Loop Transformations - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Loop Transformations

Description:

Dependence testing is required to check validity of transformation. ... Only useful when working set does not fit into cache or when there exists much interference. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 71
Provided by: imec6
Category:

less

Transcript and Presenter's Notes

Title: Loop Transformations


1
Loop Transformations
  • Motivation
  • Loop level transformations catalogus
  • Loop merging
  • Loop interchange
  • Loop unrolling
  • Unroll-and-Jam
  • Loop tiling
  • Loop Transformation Theory and Dependence Analysis

Thanks for many slides go to the DTSE people from
IMEC and Dr. Peter Knijnenburg ( 2007) from
Leiden University
2
Loop Transformations
  • Change the order in which the iteration space is
    traversed.
  • Can expose parallelism, increase available ILP,
    or improve memory behavior.
  • Dependence testing is required to check validity
    of transformation.

3
Why loop trafos
  • Example 1 in-place mapping

for (j1 jltM j) for (i0 iltN i)
Aij f(Aij-1) for (i0 iltN i)
OUTi AiM-1
for (i1 iltN i) for (j0 jltM j)
Aij f(Aij-1) OUTi AiM-1
OUT
4
Why loop trafos
Example 2 memory allocation
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(Bi)
for (i0 iltN i) Bi f(Ai) Ci
g(Bi)
N cyc.
2N cyc.
N cyc.
2 background ports
1 backgr. 1 foregr. ports
5
Loop transformation catalogue
  • merge (fusion)
  • improve locality
  • bump
  • extend/reduce
  • body split
  • reverse
  • improve regularity
  • interchange
  • improve locality/regularity
  • skew
  • tiling
  • index split
  • unroll
  • unroll-and-jam

6
Loop Merge
for Ia exp1 to exp2 A(Ia) for Ib exp1 to
exp2 B(Ib)
for I exp1 to exp2 A(I) B(I)
?
  • Improve locality
  • Reduce loop overhead

7
Loop MergeExample of locality improvement
for (i0 iltN i) Bi f(Ai) for (j0
jltN j) Cj f(Bj,Aj)
for (i0 iltN i) Bi f(Ai) Ci
f(Bi,Ai)
  • Consumptions of second loop closer to
    productions and consumptions of first loop
  • Is this always the case after merging loops?

8
Loop Merge not always allowed !
  • Data dependencies from first to second loopcan
    block Loop Merge
  • Merging is allowed if ? I cons(I) in loop 2 ?
    prod(I) in loop 1
  • Enablers Bump, Reverse, Skew

for (i0 iltN i) Bi f(Ai) for
(i0 iltN i) Ci g(BN-1)
for (i0 iltN i) Bi f(Ai) for
(i2 iltN i) Ci g(Bi-2)
N-1 gt i
i-2 lt i
9
Loop Bump
for I exp1 to exp2 A(I)
?
for I exp1N to exp2N A(I-N)
10
Loop Bump Example as enabler
for (i2 iltN i) Bi f(Ai) for (i0
iltN-2 i) Ci g(Bi2)
i2 gt i ? merging not possible
Loop Bump
for (i2 iltN i) Bi f(Ai) for (i2
iltN i) Ci-2 g(Bi2-2)
?
i22 i ? merging possible
for (i2 iltN i) Bi f(Ai) Ci-2
g(Bi)
Loop Merge
?
11
Loop Extend
for I exp1 to exp2 A(I)
exp3 ? exp1 exp4 ? exp2
?
for I exp3 to exp4 if I?exp1 and I?exp2
A(I)
12
Loop Extend Example as enabler
for (i0 iltN i) Bi f(Ai) for (i2
iltN2 i) Ci-2 g(Bi)
for (i0 iltN2 i) if(iltN) Bi
f(Ai) for (i0 iltN2 i) if(igt2)
Ci-2 g(Bi)
Loop Extend
?
for (i0 iltN2 i) if(iltN) Bi
f(Ai) if(igt2) Ci-2 g(Bi)
Loop Merge
?
13
Loop Reduce
for I exp1 to exp2 if I?exp3 and I?exp4
A(I)
?
for I max(exp1,exp3) to
min(exp2,exp4) A(I)
14
Loop Body Split
A(I) must be single-assignment its elements
should be written once
for I exp1 to exp2 A(I) B(I)
for Ia exp1 to exp2 A(Ia) for Ib exp1 to
exp2 B(Ib)
?
15
Loop Body Split Example as enabler
for (i0 iltN i) Ai f(Ai-1) Bi
g(ini) for (j0 jltN j) Ci
h(Bi,AN)
?
Loop Body Split
for (i0 iltN i) Ai f(Ai-1) for (k0
kltN k) Bk g(ink) for (j0 jltN j)
Cj h(Bj,AN)
for (i0 iltN i) Ai f(Ai-1) for (j0
jltN j) Bj g(inj) Cj h(Bj,AN)
?
Loop Merge
16
Loop Reverse
for I exp1 to exp2 A(I)
for I exp2 downto exp1 A(I)
?
OR
for I exp1 to exp2 A(exp2-(I-exp1))
17
Loop Reverse Satisfy dependencies
A0 for (i1 iltN i) Ai f(Ai-1)
  • No loop-carried dependencies allowed !

Enabler data-flow transformations ( is
associative)
A0 for (i1 iltN i) Ai Ai-1
f()
AN for (iN-1 igt0 i--) Ai Ai1
f()
18
Loop Reverse Example as enabler
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(BN-i)
Ni gt i ? merging not possible
Loop Reverse
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) CN-i g(BN-(N-i))
?
N-(N-i) i ? merging possible
Loop Merge
?
for (i0 iltN i) Bi f(Ai)
CN-i g(Bi)
19
Loop Interchange
for I1 exp1 to exp2 for I2 exp3 to exp4
A(I1, I2)
for I2 exp3 to exp4 for I1 exp1 to exp2
A(I1, I2)
?
20
Loop Interchange index traversal
j
j
Loop Interchange
i
i
for(i0 iltW i) for(j0 jltH j)
Aij
for(j0 jltH j) for(i0 iltW i)
Aij
21
Loop Interchange
  • Validity check dependence direction vectors.
  • Mostly used to improve cache behavior.
  • The innermost loop (loop index changes fastest)
    should (only) index the right-most array index
    expression in case of row-major storage, like in
    C.
  • Can improve execution time by 1 or 2 orders of
    magnitude.

22
Loop Interchange (contd)
  • Loop interchange can also expose parallelism.
  • If an inner-loop does not carry a dependence
    (entry in direction vector equals ), this loop
    can be executed in parallel.
  • Moving this loop outwards increases the
    granularity of the parallel loop iterations.

23
Loop InterchangeExample of locality improvement
for (i0 iltN i) for (j0 jltM j)
Bij f(Aj,Bij-1)
for (j0 jltM j) for (i0 iltN i)
Bij f(Aj,Bij-1)
  • In second loop
  • Aj reused N times (temporal locality)
  • However loosing spatial locality in B (assuming
    row major ordering)
  • gt Exploration required

24
Loop Interchange Satisfy dependencies
  • No loop-carried dependencies allowedunless ?
    I2 prod(I2) ? cons(I2)

for (i0 iltN i) for (j0 jltM j)
Aij f(Ai-1j1)
for (i0 iltN i) for (j0 jltM j)
Aij f(Ai-1j)
  • Enablers
  • Data-flow transformations
  • Loop Bump

25
Loop SkewBasic transformation
for I1 exp1 to exp2 for I2 exp3 to exp4
A(I1, I2)
for I1 exp1 to exp2 for I2 exp3?.I1? to
exp4?.I1? A(I1, I2-?.I1-? )
for I1 exp1 to exp2 for I2 exp3?.I1? to
exp4?.I1? A(I1, I2-?.I1-? )
?
26
Loop Skew
Loop Skewing
for(j0jltHj) for(i0iltWi) Aij
...
for(j0 jltH j) for(i0j iltWji)
AI-jj ...
27
Loop Skew Example as enabler of regularity
improvement
for (i0 iltN i) for (j0 jltM j)
f(Aij)
for (i0 iltN i) for (ji jltiM j)
f(Aj)
Loop Skew
?
for (j0 jltNM j) for (i0 iltN i) if
(jgti jltiM) f(Aj)
28
Loop Tiling
for I 0 to exp1 . exp2 A(I)
Tile size exp1
Tile factor exp2
for I1 0 to exp2 for I2 exp1.I1 to exp1.(I1
1) A(I2)
?
29
Loop Tiling
i
Loop Tiling
j
i
for(i0ilt9i) Ai ...
for(j0 jlt3 j) for(i4j ilt4j4 i)
if (ilt9) Ai ...
30
Loop Tiling
  • Improve cache reuse by dividing the iteration
    space into tiles and iterating over these tiles.
  • Only useful when working set does not fit into
    cache or when there exists much interference.
  • Two adjacent loops can legally be tiled if they
    can legally be interchanged.

31
2-D Tiling Example
  • for(i0 iltN i)
  • for(j0 jltN j)
  • Aij Bji

for(TI0 TIltN TI16) for(TJ0 TJltN
TJ16) for(iTI iltmin(TI16,N) i)
for(jTJ jltmin(TJ16,N) j)
Aij Bji
32
2-D Tiling
  • Show index space traversal

33
What is the best Tile Size?
  • Current tile size selection algorithms use a
    cache model
  • Generate collection of tile sizes
  • Estimate resulting cache miss rate
  • Select best one.
  • Only take into account L1 cache.
  • Mostly do not take into account n-way
    associativity.

34
Loop Index Split
for Ia exp1 to exp2 A(Ia)
for Ia exp1 to p A(Ia) for Ib p1 to
exp2 A(Ib)
?
35
Loop Unrolling
  • Duplicate loop body and adjust loop header.
  • Increases available ILP, reduces loop overhead,
    and increases possibilities for common
    subexpression elimination.
  • Always valid !!

36
(Partial) Loop Unrolling
for I exp1 to exp2 A(I)
A(exp1) A(exp11) A(exp12) A(exp2)
for I exp1/2 to exp2 /2 A(2l) A(2l1)
?
37
Loop Unrolling Downside
  • If unroll factor is not divisor of trip count,
    then need to add remainder loop.
  • If trip count not known at compile time, need to
    check at runtime.
  • Code size increases which may result in higher
    I-cache miss rate.
  • Global determination of optimal unroll factors is
    difficult.

38
Loop Unroll Example of removal of non-affine
iterator
for (L1 L lt 4 L) for (i0 i lt (1ltltL)
i) A(L,i)
for (i0 i lt 2 i) A(1,i) for (i0 i lt
4 i) A(2,i) for (i0 i lt 8 i)
A(3,i)
39
Unroll-and-Jam
  • Unroll outerloop and fuse new copies of the
    innerloop.
  • Increases size of loop body and hence available
    ILP.
  • Can also improve locality.

40
Unroll-and-Jam Example
for (i0iltNi) for
(j0jltNj) Aij Bji

for (i0iltNi) for
(j0jltNj) Aij Bji for
(j0jltNj) Ai1j Bji1

for (i0 iltN i2) for (j0 jltN j)
Aij Bji Ai1j Bji1
  • More ILP exposed
  • Spatial locality of B enhanced

41
Simplified loop transformation script
  • Give all loops same nesting depth
  • Use dummy 1-iteration loops if necessary
  • Improve regularity
  • Usually applying loop interchange or reverse
  • Improve locality
  • Usually applying loop merge
  • Break data dependencies with loop bump/skew
  • Sometimes loop index split or unrolling is easier

42
Loop transformation theory
  • Iteration spaces
  • Polyhedral model
  • Dependency analysis

43
Technical Preliminaries (1)
do i 2, N do j i, N xi,j0.5(xi-1,j
xi,j-1) enddo
(1)
j
  • Address expr (1)

4
  • perfect loop nest
  • iteration space
  • dependence vector

3
2
i
2
3
4
44
Technical Preliminaries (2)
Switch loop indexes do m 2, N do n 2, m
xn,m0.5(xn-1,m xn,m-1) enddo
(2)
(2)
affine transformation
45
Polyhedral Model
  • Polyhedron is set ?x Ax ? c? for some matrix A
    and bounds vector c
  • Polyhedra (or Polytopes) are objects in a
    many-dimensional space without holes
  • Iteration spaces of loops (with unit stride) can
    be represented as polyhedra
  • Array accesses and loop transformations can be
    represented as matrices

46
Iteration Space
  • A loop nest is represented as BI ? b for
    iteration vector I
  • Example
  • for(i0 ilt10i) -1 0
    0
  • for(ji jlt10j) 1 0 i
    9
  • 1 -1
    j 0
  • 0 1
    9

?
47
Array Accesses
  • Any array access Ae1e2 for linear index
    expressions e1 and e2 can be represented as an
    access matrix and offset vector.
  • A a
  • This can be considered as a mapping from the
    iteration space into the storage space of the
    array (which is a trivial polyhedron)

48
Unimodular Matrices
  • A unimodular matrix T is a matrix with integer
    entries and determinant ?1.
  • This means that such a matrix maps an object onto
    another object with exactly the same number of
    integer points in it.
  • Its inverse T¹ always exist and is unimodular as
    well.

49
Types of Unimodular Transformations
  • Loop interchange
  • Loop reversal
  • Loop skewing for arbitrary skew factor
  • Since unimodular transformations are closed under
    multiplication, any combination is a unimodular
    transformation again.

50
Application
  • Transformed loop nest is given by AT¹ I ? a
  • Any array access matrix is transformed into AT¹.
  • Transformed loop nest needs to be normalized by
    means of Fourier-Motzkin elimination to ensure
    that loop bounds are affine expressions in more
    outer loop indices.

51
Dependence Analysis
  • Consider following statements
  • S1 a b c
  • S2 d a f
  • S3 a g h
  • S1 ? S2 true or flow dependence RaW
  • S2 ? S3 anti-dependence WaR
  • S1 ? S3 output dependence WaW

52
Dependences in Loops
  • Consider the following loop
  • for(i0 iltN i)
  • S1 ai
  • S2 bi ai-1
  • Loop carried dependence S1 ? S2.
  • Need to detect if there exists i and i such that
    i i-1 in loop space.

53
Definition of Dependence
  • There exists a dependence if there two statement
    instances that refer to the same memory location
    and (at least) one of them is a write.
  • There should not be a write between these two
    statement instances.
  • In general, it is undecidable whether there exist
    a dependence.

54
Direction of Dependence
  • If there is a flow dependence between two
    statements S1 and S2 in a loop, then S1 writes to
    a variable in an earlier iteration than S2 reads
    that variable.
  • The iteration vector of the write is
    lexicographically less than the iteration vector
    of the read.
  • I ? I iff i1 i1 ??? i(k-1) i(k-1) ? ik lt
    ik for some k.

55
Direction Vectors
  • A direction vector is a vector
  • (,,?,,lt,,,?,)
  • where can denote or lt or gt.
  • Such a vector encodes a (collection of)
    dependence.
  • A loop transformation should result in a new
    direction vector for the dependence that is also
    lexicographically positive.

56
Loop Interchange
  • Interchanging two loops also interchanges the
    corresponding entries in a direction vector.
  • Example if direction vector of a dependence is
    (lt,gt) then we may not interchange the loops
    because the resulting direction would be (gt,lt)
    which is lexicographically negative.

57
Affine Bounds and Indices
  • We assume loop bounds and array index expressions
    are affine expressions
  • a0 a1 i1 ? ak ik
  • Each loop bound for loop index ik is an affine
    expressions over the previous loop indices i1 to
    i(k-1).
  • Each loop index expression is a affine expression
    over all loop indices.

58
Non-Affine Expressions
  • Index expressions like ij cannot be handled by
    dependence tests. We must assume that there
    exists a dependence.
  • An important class of index expressions are
    indirections ABi. These occur frequently in
    scientific applications (sparse matrix
    computations).
  • In embedded applications???

59
Linear Diophantine Equations
  • A linear diophantine equations is of the form
  • ?aj xj c
  • Equation has a solution iff gcd(a1,?,an) is
    divisor of c

60
GCD Test for Dependence
  • Assume single loop and two references Aabi and
    Acdi.
  • If there exist a dependence, then gcd(b,d)
    divides (c-a).
  • Note the direction of the implication!
  • If gcd(b,d) does not divide (c-a) then there
    exists no dependence.

61
GCD Test (contd)
  • However, if gcd(b,d) does divide (c-a) then
    there might exist a dependence.
  • Test is not exact since it does not take into
    account loop bounds.
  • For example
  • for(i0 ilt10 i)
  • Ai Ai10 1

62
GCD Test (contd)
  • Using the Theorem on linear diophantine
    equations, we can test in arbitrary loop nests.
  • We need one test for each direction vector.
  • Vector (,,?,,lt,?) implies that first k indices
    are the same.
  • See book by Zima for details.

63
Other Dependence Tests
  • There exist many dependence test
  • Separability test
  • GCD test
  • Banerjee test
  • Range test
  • Fourier-Motzkin test
  • Omega test
  • Exactness increases, but so does the cost.

64
Fourier-Motzkin Elimination
  • Consider a collection of linear inequalities over
    the variables i1,?,in
  • e1(i1,?,in) ? e1(i1,?,in)
  • ?
  • em(i1,?,in) ? em(i1,?,in)
  • Is this system consistent, or does there exist a
    solution?
  • FM-elimination can determine this.

65
FM-Elimination (contd)
  • First, create all pairs L(i1,?,i(n-1)) ? in and
    in ? U(i1,?,i(n-1)). This is solution for in.
  • Then create new system
  • L(i1,?,i(n-1)) ? U(i1,?,i(n-1))
  • together with all original inequalities not
    involving in.
  • This new system has one variable less and we
    continue this way.

66
FM-Elimination (contd)
  • After eliminating i1, we end up with collection
    of inequalities between constants c1 ? c1.
  • The original system is consistent iff every such
    inequality can be satisfied.
  • There does not exist an inequality like
  • 10 ? 3.
  • There may be exponentially many new inequalities
    generated!

67
Fourier-Motzkin Test
  • Loop bounds plus array index expressions generate
    sets of inequalities, using new loop indices i
    for the sink of dependence.
  • Each direction vector generates inequalities
  • i1 i1 ? i(k-1) i(k-1) ik lt ik
  • If all these systems are inconsistent, then there
    exists no dependence.
  • This test is not exact (real solutions but no
    integer ones) but almost.

68
N-Dimensional Arrays
  • Test in each dimension separately.
  • This can introduce another level of inaccuracy.
  • Some tests (FM and Omega test) can test in many
    dimensions at the same time.
  • Otherwise, you can linearize an array Transform
    a logically N-dimensional array to its
    one-dimensional storage format.

69
Hierarchy of Tests
  • Try cheap test, then more expensive ones
  • if (cheap test1 NO)
  • then print NO
  • else if (test2 NO)
  • then print NO
  • else if (test3 NO)
  • then print NO
  • else ?

70
Practical Dependence Testing
  • Cheap tests, like GCD and Banerjee tests, can
    disprove many dependences.
  • Adding expensive tests only disproves a few more
    possible dependences.
  • Compiler writer needs to trade-off compilation
    time and accuracy of dependence testing.
  • For time critical applications, expensive tests
    like Omega test (exact!) can be used.
Write a Comment
User Comments (0)
About PowerShow.com