Title: Loop Transformations
1Loop Transformations
- Motivation
- Loop level transformations catalogus
- Loop merging
- Loop interchange
- Loop unrolling
- Unroll-and-Jam
- Loop tiling
- Loop Transformation Theory and Dependence Analysis
Thanks for many slides go to the DTSE people from
IMEC and Dr. Peter Knijnenburg ( 2007) from
Leiden University
2Loop Transformations
- Change the order in which the iteration space is
traversed. - Can expose parallelism, increase available ILP,
or improve memory behavior. - Dependence testing is required to check validity
of transformation.
3Why loop trafos
- Example 1 in-place mapping
for (j1 jltM j) for (i0 iltN i)
Aij f(Aij-1) for (i0 iltN i)
OUTi AiM-1
for (i1 iltN i) for (j0 jltM j)
Aij f(Aij-1) OUTi AiM-1
OUT
4Why loop trafos
Example 2 memory allocation
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(Bi)
for (i0 iltN i) Bi f(Ai) Ci
g(Bi)
N cyc.
2N cyc.
N cyc.
2 background ports
1 backgr. 1 foregr. ports
5Loop transformation catalogue
- merge (fusion)
- improve locality
- bump
- extend/reduce
- body split
- reverse
- improve regularity
- interchange
- improve locality/regularity
- skew
- tiling
- index split
- unroll
- unroll-and-jam
6Loop Merge
for Ia exp1 to exp2 A(Ia) for Ib exp1 to
exp2 B(Ib)
for I exp1 to exp2 A(I) B(I)
?
- Improve locality
- Reduce loop overhead
7Loop MergeExample of locality improvement
for (i0 iltN i) Bi f(Ai) for (j0
jltN j) Cj f(Bj,Aj)
for (i0 iltN i) Bi f(Ai) Ci
f(Bi,Ai)
- Consumptions of second loop closer to
productions and consumptions of first loop - Is this always the case after merging loops?
8Loop Merge not always allowed !
- Data dependencies from first to second loopcan
block Loop Merge - Merging is allowed if ? I cons(I) in loop 2 ?
prod(I) in loop 1 - Enablers Bump, Reverse, Skew
for (i0 iltN i) Bi f(Ai) for
(i0 iltN i) Ci g(BN-1)
for (i0 iltN i) Bi f(Ai) for
(i2 iltN i) Ci g(Bi-2)
N-1 gt i
i-2 lt i
9Loop Bump
for I exp1 to exp2 A(I)
?
for I exp1N to exp2N A(I-N)
10Loop Bump Example as enabler
for (i2 iltN i) Bi f(Ai) for (i0
iltN-2 i) Ci g(Bi2)
i2 gt i ? merging not possible
Loop Bump
for (i2 iltN i) Bi f(Ai) for (i2
iltN i) Ci-2 g(Bi2-2)
?
i22 i ? merging possible
for (i2 iltN i) Bi f(Ai) Ci-2
g(Bi)
Loop Merge
?
11Loop Extend
for I exp1 to exp2 A(I)
exp3 ? exp1 exp4 ? exp2
?
for I exp3 to exp4 if I?exp1 and I?exp2
A(I)
12Loop Extend Example as enabler
for (i0 iltN i) Bi f(Ai) for (i2
iltN2 i) Ci-2 g(Bi)
for (i0 iltN2 i) if(iltN) Bi
f(Ai) for (i0 iltN2 i) if(igt2)
Ci-2 g(Bi)
Loop Extend
?
for (i0 iltN2 i) if(iltN) Bi
f(Ai) if(igt2) Ci-2 g(Bi)
Loop Merge
?
13Loop Reduce
for I exp1 to exp2 if I?exp3 and I?exp4
A(I)
?
for I max(exp1,exp3) to
min(exp2,exp4) A(I)
14Loop Body Split
A(I) must be single-assignment its elements
should be written once
for I exp1 to exp2 A(I) B(I)
for Ia exp1 to exp2 A(Ia) for Ib exp1 to
exp2 B(Ib)
?
15Loop Body Split Example as enabler
for (i0 iltN i) Ai f(Ai-1) Bi
g(ini) for (j0 jltN j) Ci
h(Bi,AN)
?
Loop Body Split
for (i0 iltN i) Ai f(Ai-1) for (k0
kltN k) Bk g(ink) for (j0 jltN j)
Cj h(Bj,AN)
for (i0 iltN i) Ai f(Ai-1) for (j0
jltN j) Bj g(inj) Cj h(Bj,AN)
?
Loop Merge
16Loop Reverse
for I exp1 to exp2 A(I)
for I exp2 downto exp1 A(I)
?
OR
for I exp1 to exp2 A(exp2-(I-exp1))
17Loop Reverse Satisfy dependencies
A0 for (i1 iltN i) Ai f(Ai-1)
- No loop-carried dependencies allowed !
Enabler data-flow transformations ( is
associative)
A0 for (i1 iltN i) Ai Ai-1
f()
AN for (iN-1 igt0 i--) Ai Ai1
f()
18Loop Reverse Example as enabler
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(BN-i)
Ni gt i ? merging not possible
Loop Reverse
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) CN-i g(BN-(N-i))
?
N-(N-i) i ? merging possible
Loop Merge
?
for (i0 iltN i) Bi f(Ai)
CN-i g(Bi)
19Loop Interchange
for I1 exp1 to exp2 for I2 exp3 to exp4
A(I1, I2)
for I2 exp3 to exp4 for I1 exp1 to exp2
A(I1, I2)
?
20Loop Interchange index traversal
j
j
Loop Interchange
i
i
for(i0 iltW i) for(j0 jltH j)
Aij
for(j0 jltH j) for(i0 iltW i)
Aij
21Loop Interchange
- Validity check dependence direction vectors.
- Mostly used to improve cache behavior.
- The innermost loop (loop index changes fastest)
should (only) index the right-most array index
expression in case of row-major storage, like in
C. - Can improve execution time by 1 or 2 orders of
magnitude.
22Loop Interchange (contd)
- Loop interchange can also expose parallelism.
- If an inner-loop does not carry a dependence
(entry in direction vector equals ), this loop
can be executed in parallel. - Moving this loop outwards increases the
granularity of the parallel loop iterations.
23Loop InterchangeExample of locality improvement
for (i0 iltN i) for (j0 jltM j)
Bij f(Aj,Bij-1)
for (j0 jltM j) for (i0 iltN i)
Bij f(Aj,Bij-1)
- In second loop
- Aj reused N times (temporal locality)
- However loosing spatial locality in B (assuming
row major ordering) - gt Exploration required
24Loop Interchange Satisfy dependencies
- No loop-carried dependencies allowedunless ?
I2 prod(I2) ? cons(I2)
for (i0 iltN i) for (j0 jltM j)
Aij f(Ai-1j1)
for (i0 iltN i) for (j0 jltM j)
Aij f(Ai-1j)
- Enablers
- Data-flow transformations
- Loop Bump
25Loop SkewBasic transformation
for I1 exp1 to exp2 for I2 exp3 to exp4
A(I1, I2)
for I1 exp1 to exp2 for I2 exp3?.I1? to
exp4?.I1? A(I1, I2-?.I1-? )
for I1 exp1 to exp2 for I2 exp3?.I1? to
exp4?.I1? A(I1, I2-?.I1-? )
?
26Loop Skew
Loop Skewing
for(j0jltHj) for(i0iltWi) Aij
...
for(j0 jltH j) for(i0j iltWji)
AI-jj ...
27Loop Skew Example as enabler of regularity
improvement
for (i0 iltN i) for (j0 jltM j)
f(Aij)
for (i0 iltN i) for (ji jltiM j)
f(Aj)
Loop Skew
?
for (j0 jltNM j) for (i0 iltN i) if
(jgti jltiM) f(Aj)
28Loop Tiling
for I 0 to exp1 . exp2 A(I)
Tile size exp1
Tile factor exp2
for I1 0 to exp2 for I2 exp1.I1 to exp1.(I1
1) A(I2)
?
29Loop Tiling
i
Loop Tiling
j
i
for(i0ilt9i) Ai ...
for(j0 jlt3 j) for(i4j ilt4j4 i)
if (ilt9) Ai ...
30Loop Tiling
- Improve cache reuse by dividing the iteration
space into tiles and iterating over these tiles. - Only useful when working set does not fit into
cache or when there exists much interference. - Two adjacent loops can legally be tiled if they
can legally be interchanged.
312-D Tiling Example
- for(i0 iltN i)
- for(j0 jltN j)
- Aij Bji
for(TI0 TIltN TI16) for(TJ0 TJltN
TJ16) for(iTI iltmin(TI16,N) i)
for(jTJ jltmin(TJ16,N) j)
Aij Bji
322-D Tiling
- Show index space traversal
33What is the best Tile Size?
- Current tile size selection algorithms use a
cache model - Generate collection of tile sizes
- Estimate resulting cache miss rate
- Select best one.
- Only take into account L1 cache.
- Mostly do not take into account n-way
associativity.
34Loop Index Split
for Ia exp1 to exp2 A(Ia)
for Ia exp1 to p A(Ia) for Ib p1 to
exp2 A(Ib)
?
35Loop Unrolling
- Duplicate loop body and adjust loop header.
- Increases available ILP, reduces loop overhead,
and increases possibilities for common
subexpression elimination. - Always valid !!
36(Partial) Loop Unrolling
for I exp1 to exp2 A(I)
A(exp1) A(exp11) A(exp12) A(exp2)
for I exp1/2 to exp2 /2 A(2l) A(2l1)
?
37Loop Unrolling Downside
- If unroll factor is not divisor of trip count,
then need to add remainder loop. - If trip count not known at compile time, need to
check at runtime. - Code size increases which may result in higher
I-cache miss rate. - Global determination of optimal unroll factors is
difficult.
38Loop Unroll Example of removal of non-affine
iterator
for (L1 L lt 4 L) for (i0 i lt (1ltltL)
i) A(L,i)
for (i0 i lt 2 i) A(1,i) for (i0 i lt
4 i) A(2,i) for (i0 i lt 8 i)
A(3,i)
39Unroll-and-Jam
- Unroll outerloop and fuse new copies of the
innerloop. - Increases size of loop body and hence available
ILP. - Can also improve locality.
40Unroll-and-Jam Example
for (i0iltNi) for
(j0jltNj) Aij Bji
for (i0iltNi) for
(j0jltNj) Aij Bji for
(j0jltNj) Ai1j Bji1
for (i0 iltN i2) for (j0 jltN j)
Aij Bji Ai1j Bji1
- More ILP exposed
- Spatial locality of B enhanced
41Simplified loop transformation script
- Give all loops same nesting depth
- Use dummy 1-iteration loops if necessary
- Improve regularity
- Usually applying loop interchange or reverse
- Improve locality
- Usually applying loop merge
- Break data dependencies with loop bump/skew
- Sometimes loop index split or unrolling is easier
42Loop transformation theory
- Iteration spaces
- Polyhedral model
- Dependency analysis
43Technical Preliminaries (1)
do i 2, N do j i, N xi,j0.5(xi-1,j
xi,j-1) enddo
(1)
j
4
- perfect loop nest
- iteration space
- dependence vector
3
2
i
2
3
4
44Technical Preliminaries (2)
Switch loop indexes do m 2, N do n 2, m
xn,m0.5(xn-1,m xn,m-1) enddo
(2)
(2)
affine transformation
45Polyhedral Model
- Polyhedron is set ?x Ax ? c? for some matrix A
and bounds vector c - Polyhedra (or Polytopes) are objects in a
many-dimensional space without holes - Iteration spaces of loops (with unit stride) can
be represented as polyhedra - Array accesses and loop transformations can be
represented as matrices
46Iteration Space
- A loop nest is represented as BI ? b for
iteration vector I - Example
- for(i0 ilt10i) -1 0
0 - for(ji jlt10j) 1 0 i
9 - 1 -1
j 0 - 0 1
9 -
?
47Array Accesses
- Any array access Ae1e2 for linear index
expressions e1 and e2 can be represented as an
access matrix and offset vector. - A a
- This can be considered as a mapping from the
iteration space into the storage space of the
array (which is a trivial polyhedron)
48Unimodular Matrices
- A unimodular matrix T is a matrix with integer
entries and determinant ?1. - This means that such a matrix maps an object onto
another object with exactly the same number of
integer points in it. - Its inverse T¹ always exist and is unimodular as
well.
49Types of Unimodular Transformations
- Loop interchange
- Loop reversal
- Loop skewing for arbitrary skew factor
- Since unimodular transformations are closed under
multiplication, any combination is a unimodular
transformation again.
50Application
- Transformed loop nest is given by AT¹ I ? a
- Any array access matrix is transformed into AT¹.
- Transformed loop nest needs to be normalized by
means of Fourier-Motzkin elimination to ensure
that loop bounds are affine expressions in more
outer loop indices.
51Dependence Analysis
- Consider following statements
- S1 a b c
- S2 d a f
- S3 a g h
- S1 ? S2 true or flow dependence RaW
- S2 ? S3 anti-dependence WaR
- S1 ? S3 output dependence WaW
52Dependences in Loops
- Consider the following loop
- for(i0 iltN i)
- S1 ai
- S2 bi ai-1
- Loop carried dependence S1 ? S2.
- Need to detect if there exists i and i such that
i i-1 in loop space.
53Definition of Dependence
- There exists a dependence if there two statement
instances that refer to the same memory location
and (at least) one of them is a write. - There should not be a write between these two
statement instances. - In general, it is undecidable whether there exist
a dependence.
54Direction of Dependence
- If there is a flow dependence between two
statements S1 and S2 in a loop, then S1 writes to
a variable in an earlier iteration than S2 reads
that variable. - The iteration vector of the write is
lexicographically less than the iteration vector
of the read. - I ? I iff i1 i1 ??? i(k-1) i(k-1) ? ik lt
ik for some k.
55Direction Vectors
- A direction vector is a vector
- (,,?,,lt,,,?,)
- where can denote or lt or gt.
- Such a vector encodes a (collection of)
dependence. - A loop transformation should result in a new
direction vector for the dependence that is also
lexicographically positive.
56Loop Interchange
- Interchanging two loops also interchanges the
corresponding entries in a direction vector. - Example if direction vector of a dependence is
(lt,gt) then we may not interchange the loops
because the resulting direction would be (gt,lt)
which is lexicographically negative.
57Affine Bounds and Indices
- We assume loop bounds and array index expressions
are affine expressions - a0 a1 i1 ? ak ik
- Each loop bound for loop index ik is an affine
expressions over the previous loop indices i1 to
i(k-1). - Each loop index expression is a affine expression
over all loop indices.
58Non-Affine Expressions
- Index expressions like ij cannot be handled by
dependence tests. We must assume that there
exists a dependence. - An important class of index expressions are
indirections ABi. These occur frequently in
scientific applications (sparse matrix
computations). - In embedded applications???
59Linear Diophantine Equations
- A linear diophantine equations is of the form
- ?aj xj c
- Equation has a solution iff gcd(a1,?,an) is
divisor of c
60GCD Test for Dependence
- Assume single loop and two references Aabi and
Acdi. - If there exist a dependence, then gcd(b,d)
divides (c-a). - Note the direction of the implication!
- If gcd(b,d) does not divide (c-a) then there
exists no dependence.
61GCD Test (contd)
- However, if gcd(b,d) does divide (c-a) then
there might exist a dependence. - Test is not exact since it does not take into
account loop bounds. - For example
- for(i0 ilt10 i)
- Ai Ai10 1
62GCD Test (contd)
- Using the Theorem on linear diophantine
equations, we can test in arbitrary loop nests. - We need one test for each direction vector.
- Vector (,,?,,lt,?) implies that first k indices
are the same. - See book by Zima for details.
63Other Dependence Tests
- There exist many dependence test
- Separability test
- GCD test
- Banerjee test
- Range test
- Fourier-Motzkin test
- Omega test
- Exactness increases, but so does the cost.
64Fourier-Motzkin Elimination
- Consider a collection of linear inequalities over
the variables i1,?,in - e1(i1,?,in) ? e1(i1,?,in)
- ?
- em(i1,?,in) ? em(i1,?,in)
- Is this system consistent, or does there exist a
solution? - FM-elimination can determine this.
65FM-Elimination (contd)
- First, create all pairs L(i1,?,i(n-1)) ? in and
in ? U(i1,?,i(n-1)). This is solution for in. - Then create new system
- L(i1,?,i(n-1)) ? U(i1,?,i(n-1))
- together with all original inequalities not
involving in. - This new system has one variable less and we
continue this way.
66FM-Elimination (contd)
- After eliminating i1, we end up with collection
of inequalities between constants c1 ? c1. - The original system is consistent iff every such
inequality can be satisfied. - There does not exist an inequality like
- 10 ? 3.
- There may be exponentially many new inequalities
generated!
67Fourier-Motzkin Test
- Loop bounds plus array index expressions generate
sets of inequalities, using new loop indices i
for the sink of dependence. - Each direction vector generates inequalities
- i1 i1 ? i(k-1) i(k-1) ik lt ik
- If all these systems are inconsistent, then there
exists no dependence. - This test is not exact (real solutions but no
integer ones) but almost.
68N-Dimensional Arrays
- Test in each dimension separately.
- This can introduce another level of inaccuracy.
- Some tests (FM and Omega test) can test in many
dimensions at the same time. - Otherwise, you can linearize an array Transform
a logically N-dimensional array to its
one-dimensional storage format.
69Hierarchy of Tests
- Try cheap test, then more expensive ones
- if (cheap test1 NO)
- then print NO
- else if (test2 NO)
- then print NO
- else if (test3 NO)
- then print NO
- else ?
70Practical Dependence Testing
- Cheap tests, like GCD and Banerjee tests, can
disprove many dependences. - Adding expensive tests only disproves a few more
possible dependences. - Compiler writer needs to trade-off compilation
time and accuracy of dependence testing. - For time critical applications, expensive tests
like Omega test (exact!) can be used.