Title: Recursive Variable Expansion: A Loop Transformation for Reconfigurable systems
1Recursive Variable Expansion A Loop
Transformation for Reconfigurable systems
2Outline
- Introduction
- General Loop Transformations
- Motivational Example
- Goals for the new Technique
- Proposed Transformation
- Suitable Benchmarks
- Future work
3Introduction
- Loop Parallelization
- Why?
- Problem
- Data dependencies
- Loop Transformation
4General Loop Transformations
- Derived for Parallel Computers
- Commonly used Loop level optimization for
Reconfigurable Computing - Loop Unrolling
- Loop Tiling
- Loop Fusion
- Loop Peeling
- Loop Fission
- Loop Reversal
- Loop Interchanging
- Loop Skewing
- Software Pipelining
5When to apply loop transformation?
- Most difficult thing to answer.
- Selection and sequence of suitable techniques
depends upon the type of loops nest and
dependency relationship.
6Motivational Example
- For i 2 to n
- For j 2 to m
- S1 Ai, j Ai-1, j c
- S2 Bi, j Ai, j Ai, j-1
- End For
- End For
7Data Dependency
- For i 2 to n
- For j 2 to m
- S1 Ai, j Ai-1, j c
- S2 Bi, j Ai, j Ai, j-1
- End For
- End For
Iteration Space showing dependencies among the
iterations
8Loop Skewing
- For I 2 to mn
- For J max(2,I-n) to min(m,I)
- S1 AI-J, J AI-J-1, J c
- S2 BI-J, J AI-J, J AI-J, J-1
- End For
- End For
- Inner loop is parallelizable
9Speed and Area estimation
For I 2 to mn For J max(2,I-n) to
min(m,I) S1 AI-J, J AI-J-1, J c S2
BI-J, J AI-J, J AI-J, J-1 End For End For
- Assumption
- Each addition takes one cycle.
- Area is directly proportional to the number of
terms to be added. - Time If the whole inner loop is executed in
parallel, then the time for each iteration of
outer I loop is 2 cycles. - Total time 2(mn-1)cycles
10Speed and Area estimation
For I 2 to mn For J max(2,I-n) to
min(m,I) S1 AI-J, J AI-J-1, J c S2
BI-J, J AI-J, J AI-J, J-1 End For End For
- Area number of terms to be added for each inner
iteration 4 - Total area 4 x maximum number of iterations of
the inner loop for some outer loop. - Maximum number of terms to be added will be
4min(m, n2)-1.
11Goals for the new Technique
- How to transform a part of code to extract
maximum amount of parallelism and speed ? - What can be the maximum speed up one can achieve
in a program given a lot of resources ? - How to use the Reconfigurable architecture more
efficiently ?
12Proposed Transformation
- Assumption Lot lot of resources
- Parallelism is constraint by data dependencies
- Is it possible to transform the program such that
the resulted program is free of data dependency ? - Which implies that we can get maximum parallelism
- Yes
13Recursive Variable Expansion
- Basic Idea
- If there are any two statements Si and Tj for
some iterations i and j, such that SidTj , then
the computation done in Si can be replicated in
Tj makes it independent of Si. Repeat this until
Tj becomes only the function of the inputs or
knowns. - Like in example
- A4, 3 A3, 3 c
- A2, 3 c c
- A1, 3 c c c
14Speed and Area Estimate
For i 2 to n For j 2 to m S1 Ai, j
Ai-1, j c S2 Bi, j Ai, j Ai,
j-1 End For End For
- Area
- Statement S1 is only dependent on i, Let the
number of addition terms are denoted by T(i) - T(i) T(i-1) 1
- Solution T(i) i
- Suppose the number of addition terms in S2 are
denoted by S(i) 2T(i) 2i - Number of addition terms in one iteration is
denoted by t(i) S(i) T(i) 3i - Total number of addition terms for all the
iteration is given by
15Speed and Area Computation
For i 2 to n For j 2 to m S1 Ai, j
Ai-1, j c S2 Bi, j Ai, j Ai,
j-1 End For End For
- Is this area is always going to be high
- No, if the output of the loop nest or variable
which is used later is only Bn, m, - Then there is no need to expand all the
intermediate computations on the FPGA, as Bn, m
can be computed readily after the expansion, as
it is only a function of inputs. In this case the
number of addition terms is S(n) 2n.
16Speed and Area Estimate
- Speed Suppose we want to add n terms, we can add
all the consecutive terms and get the sum and
then repeating the same for the output until we
have only one left.
17Speed and Area Estimate
- Can we exploit the flexibility of FPGA to improve
this design ? - Wallace tree
CSA Adder for 3 inputs of 4 bit each
18Speed and Area Estimation
Wallace Tree for nine inputs
19Speed and Area Estimation
- The number of levels required by Wallace tree to
add the n inputs log1.5(n/2) - We can use a fast CPA which takes logk levels for
k bit addition - As all the computation is done in parallel, then
the time taken time taken by the term with max.
no of addition, which is S(n)2n - If we assume that one cycle 8 gate delays
- Then time (log1.5n logk)/8 cycles
20Other operation efficiently handled by Wallace
tree
- Subtract
- Multiply
- To multiply the n terms of k bit will require
O(lognlogk) cycles.
21Comparison
22Benchmark
- DCT
- 16 tap FIR filter of a 32 element array
- Matrix Multiply- 12X6, 6X4 Integer Matrices
- Sobel Edge detection
23Software Implementation
- GPP is IBM PowerPC 405 at 250MHz implemented on
FPGA fabric. - SW- Compiled using GCC 4.2.0, optimization level
O3, inner loop is completely unrolled. - Instructions counted using PSIM.
24Hardware Specifications
- FPGA- Virtex II Pro XC2Vp30-7ff896
- VHDL generated using Xilinx ISE version 8.2.022
- Synthesis tool XST
- Simulator ModelSim-SE
25Results
26Area
- Virtex 4 -XC4VLX200 contains 89,088
27Benefits
- It removes dependencies from the part of the
program on which it is applied - On RC, more efficient implementation than other
transformation. E.g. add, multiplication,
subtraction etc. - Single technique exploits more parallelism with
out making wide selection and scheduling for
other transformation - Can work with non perfectly nested and
un-normalized loops.
28Limitations
- Like complete loop unrolling, need to know the
bounds for the loops. - Some statements may grow exponential.
- More suitable for no control dependency or
limited control dependency.
29Future work
- limit area
- Pipelining
- Partial expansion
- To work with loops with more extensive control
dependency.
30Questions
31Problems?
- A(n, m) A(n-1, m) A(n, m-1)
32(No Transcript)