Recursive Variable Expansion: A Loop Transformation for Reconfigurable systems PowerPoint PPT Presentation

presentation player overlay
1 / 32
About This Presentation
Transcript and Presenter's Notes

Title: Recursive Variable Expansion: A Loop Transformation for Reconfigurable systems


1
Recursive Variable Expansion A Loop
Transformation for Reconfigurable systems
  • Zubair Nawaz

2
Outline
  • Introduction
  • General Loop Transformations
  • Motivational Example
  • Goals for the new Technique
  • Proposed Transformation
  • Suitable Benchmarks
  • Future work

3
Introduction
  • Loop Parallelization
  • Why?
  • Problem
  • Data dependencies
  • Loop Transformation

4
General Loop Transformations
  • Derived for Parallel Computers
  • Commonly used Loop level optimization for
    Reconfigurable Computing
  • Loop Unrolling
  • Loop Tiling
  • Loop Fusion
  • Loop Peeling
  • Loop Fission
  • Loop Reversal
  • Loop Interchanging
  • Loop Skewing
  • Software Pipelining

5
When to apply loop transformation?
  • Most difficult thing to answer.
  • Selection and sequence of suitable techniques
    depends upon the type of loops nest and
    dependency relationship.

6
Motivational Example
  • For i 2 to n
  • For j 2 to m
  • S1 Ai, j Ai-1, j c
  • S2 Bi, j Ai, j Ai, j-1
  • End For
  • End For

7
Data Dependency
  • For i 2 to n
  • For j 2 to m
  • S1 Ai, j Ai-1, j c
  • S2 Bi, j Ai, j Ai, j-1
  • End For
  • End For

Iteration Space showing dependencies among the
iterations
8
Loop Skewing
  • For I 2 to mn
  • For J max(2,I-n) to min(m,I)
  • S1 AI-J, J AI-J-1, J c
  • S2 BI-J, J AI-J, J AI-J, J-1
  • End For
  • End For
  • Inner loop is parallelizable

9
Speed and Area estimation
For I 2 to mn For J max(2,I-n) to
min(m,I) S1 AI-J, J AI-J-1, J c S2
BI-J, J AI-J, J AI-J, J-1 End For End For
  • Assumption
  • Each addition takes one cycle.
  • Area is directly proportional to the number of
    terms to be added.
  • Time If the whole inner loop is executed in
    parallel, then the time for each iteration of
    outer I loop is 2 cycles.
  • Total time 2(mn-1)cycles

10
Speed and Area estimation
For I 2 to mn For J max(2,I-n) to
min(m,I) S1 AI-J, J AI-J-1, J c S2
BI-J, J AI-J, J AI-J, J-1 End For End For
  • Area number of terms to be added for each inner
    iteration 4
  • Total area 4 x maximum number of iterations of
    the inner loop for some outer loop.
  • Maximum number of terms to be added will be
    4min(m, n2)-1.

11
Goals for the new Technique
  • How to transform a part of code to extract
    maximum amount of parallelism and speed ?
  • What can be the maximum speed up one can achieve
    in a program given a lot of resources ?
  • How to use the Reconfigurable architecture more
    efficiently ?

12
Proposed Transformation
  • Assumption Lot lot of resources
  • Parallelism is constraint by data dependencies
  • Is it possible to transform the program such that
    the resulted program is free of data dependency ?
  • Which implies that we can get maximum parallelism
  • Yes

13
Recursive Variable Expansion
  • Basic Idea
  • If there are any two statements Si and Tj for
    some iterations i and j, such that SidTj , then
    the computation done in Si can be replicated in
    Tj makes it independent of Si. Repeat this until
    Tj becomes only the function of the inputs or
    knowns.
  • Like in example
  • A4, 3 A3, 3 c
  • A2, 3 c c
  • A1, 3 c c c

14
Speed and Area Estimate
For i 2 to n For j 2 to m S1 Ai, j
Ai-1, j c S2 Bi, j Ai, j Ai,
j-1 End For End For
  • Area
  • Statement S1 is only dependent on i, Let the
    number of addition terms are denoted by T(i)
  • T(i) T(i-1) 1
  • Solution T(i) i
  • Suppose the number of addition terms in S2 are
    denoted by S(i) 2T(i) 2i
  • Number of addition terms in one iteration is
    denoted by t(i) S(i) T(i) 3i
  • Total number of addition terms for all the
    iteration is given by

15
Speed and Area Computation
For i 2 to n For j 2 to m S1 Ai, j
Ai-1, j c S2 Bi, j Ai, j Ai,
j-1 End For End For
  • Is this area is always going to be high
  • No, if the output of the loop nest or variable
    which is used later is only Bn, m,
  • Then there is no need to expand all the
    intermediate computations on the FPGA, as Bn, m
    can be computed readily after the expansion, as
    it is only a function of inputs. In this case the
    number of addition terms is S(n) 2n.

16
Speed and Area Estimate
  • Speed Suppose we want to add n terms, we can add
    all the consecutive terms and get the sum and
    then repeating the same for the output until we
    have only one left.

17
Speed and Area Estimate
  • Can we exploit the flexibility of FPGA to improve
    this design ?
  • Wallace tree

CSA Adder for 3 inputs of 4 bit each
18
Speed and Area Estimation
Wallace Tree for nine inputs
19
Speed and Area Estimation
  • The number of levels required by Wallace tree to
    add the n inputs log1.5(n/2)
  • We can use a fast CPA which takes logk levels for
    k bit addition
  • As all the computation is done in parallel, then
    the time taken time taken by the term with max.
    no of addition, which is S(n)2n
  • If we assume that one cycle 8 gate delays
  • Then time (log1.5n logk)/8 cycles

20
Other operation efficiently handled by Wallace
tree
  • Subtract
  • Multiply
  • To multiply the n terms of k bit will require
    O(lognlogk) cycles.

21
Comparison
22
Benchmark
  • DCT
  • 16 tap FIR filter of a 32 element array
  • Matrix Multiply- 12X6, 6X4 Integer Matrices
  • Sobel Edge detection

23
Software Implementation
  • GPP is IBM PowerPC 405 at 250MHz implemented on
    FPGA fabric.
  • SW- Compiled using GCC 4.2.0, optimization level
    O3, inner loop is completely unrolled.
  • Instructions counted using PSIM.

24
Hardware Specifications
  • FPGA- Virtex II Pro XC2Vp30-7ff896
  • VHDL generated using Xilinx ISE version 8.2.022
  • Synthesis tool XST
  • Simulator ModelSim-SE

25
Results
26
Area
  • Virtex 4 -XC4VLX200 contains 89,088

27
Benefits
  • It removes dependencies from the part of the
    program on which it is applied
  • On RC, more efficient implementation than other
    transformation. E.g. add, multiplication,
    subtraction etc.
  • Single technique exploits more parallelism with
    out making wide selection and scheduling for
    other transformation
  • Can work with non perfectly nested and
    un-normalized loops.

28
Limitations
  • Like complete loop unrolling, need to know the
    bounds for the loops.
  • Some statements may grow exponential.
  • More suitable for no control dependency or
    limited control dependency.

29
Future work
  • limit area
  • Pipelining
  • Partial expansion
  • To work with loops with more extensive control
    dependency.

30
Questions
31
Problems?
  • A(n, m) A(n-1, m) A(n, m-1)

32
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com