Loop Transformations

About This Presentation

Title:

Loop Transformations

Description:

Dependence testing is required to check validity of transformation. ... Only useful when working set does not fit into cache or when there exists much interference. ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 71

Provided by: imec6

Category:

more less

Transcript and Presenter's Notes

Title: Loop Transformations

1
Loop Transformations

Motivation
Loop level transformations catalogus
Loop merging
Loop interchange
Loop unrolling
Unroll-and-Jam
Loop tiling
Loop Transformation Theory and Dependence Analysis

Thanks for many slides go to the DTSE people from
IMEC and Dr. Peter Knijnenburg ( 2007) from
Leiden University
2
Loop Transformations

Change the order in which the iteration space is
traversed.
Can expose parallelism, increase available ILP,
or improve memory behavior.
Dependence testing is required to check validity
of transformation.

3
Why loop trafos

Example 1 in-place mapping

for (j1 jltM j) for (i0 iltN i)
Aij f(Aij-1) for (i0 iltN i)
OUTi AiM-1
for (i1 iltN i) for (j0 jltM j)
Aij f(Aij-1) OUTi AiM-1
OUT
4
Why loop trafos
Example 2 memory allocation
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(Bi)
for (i0 iltN i) Bi f(Ai) Ci
g(Bi)
N cyc.
2N cyc.
N cyc.
2 background ports
1 backgr. 1 foregr. ports
5
Loop transformation catalogue

merge (fusion)
improve locality
bump
extend/reduce
body split
reverse
improve regularity

interchange
improve locality/regularity
skew
tiling
index split
unroll
unroll-and-jam

6
Loop Merge
for Ia exp1 to exp2 A(Ia) for Ib exp1 to
exp2 B(Ib)
for I exp1 to exp2 A(I) B(I)
?

Improve locality
Reduce loop overhead

7
Loop MergeExample of locality improvement
for (i0 iltN i) Bi f(Ai) for (j0
jltN j) Cj f(Bj,Aj)
for (i0 iltN i) Bi f(Ai) Ci
f(Bi,Ai)

Consumptions of second loop closer to
productions and consumptions of first loop
Is this always the case after merging loops?

8
Loop Merge not always allowed !

Data dependencies from first to second loopcan
block Loop Merge
Merging is allowed if ? I cons(I) in loop 2 ?
prod(I) in loop 1
Enablers Bump, Reverse, Skew

for (i0 iltN i) Bi f(Ai) for
(i0 iltN i) Ci g(BN-1)
for (i0 iltN i) Bi f(Ai) for
(i2 iltN i) Ci g(Bi-2)
N-1 gt i
i-2 lt i
9
Loop Bump
for I exp1 to exp2 A(I)
?
for I exp1N to exp2N A(I-N)
10
Loop Bump Example as enabler
for (i2 iltN i) Bi f(Ai) for (i0
iltN-2 i) Ci g(Bi2)
i2 gt i ? merging not possible
Loop Bump
for (i2 iltN i) Bi f(Ai) for (i2
iltN i) Ci-2 g(Bi2-2)
?
i22 i ? merging possible
for (i2 iltN i) Bi f(Ai) Ci-2
g(Bi)
Loop Merge
?
11
Loop Extend
for I exp1 to exp2 A(I)
exp3 ? exp1 exp4 ? exp2
?
for I exp3 to exp4 if I?exp1 and I?exp2
A(I)
12
Loop Extend Example as enabler
for (i0 iltN i) Bi f(Ai) for (i2
iltN2 i) Ci-2 g(Bi)
for (i0 iltN2 i) if(iltN) Bi
f(Ai) for (i0 iltN2 i) if(igt2)
Ci-2 g(Bi)
Loop Extend
?
for (i0 iltN2 i) if(iltN) Bi
f(Ai) if(igt2) Ci-2 g(Bi)
Loop Merge
?
13
Loop Reduce
for I exp1 to exp2 if I?exp3 and I?exp4
A(I)
?
for I max(exp1,exp3) to
min(exp2,exp4) A(I)
14
Loop Body Split
A(I) must be single-assignment its elements
should be written once
for I exp1 to exp2 A(I) B(I)
for Ia exp1 to exp2 A(Ia) for Ib exp1 to
exp2 B(Ib)
?
15
Loop Body Split Example as enabler
for (i0 iltN i) Ai f(Ai-1) Bi
g(ini) for (j0 jltN j) Ci
h(Bi,AN)
?
Loop Body Split
for (i0 iltN i) Ai f(Ai-1) for (k0
kltN k) Bk g(ink) for (j0 jltN j)
Cj h(Bj,AN)
for (i0 iltN i) Ai f(Ai-1) for (j0
jltN j) Bj g(inj) Cj h(Bj,AN)
?
Loop Merge
16
Loop Reverse
for I exp1 to exp2 A(I)
for I exp2 downto exp1 A(I)
?
OR
for I exp1 to exp2 A(exp2-(I-exp1))
17
Loop Reverse Satisfy dependencies
A0 for (i1 iltN i) Ai f(Ai-1)

No loop-carried dependencies allowed !

Enabler data-flow transformations ( is
associative)
A0 for (i1 iltN i) Ai Ai-1
f()
AN for (iN-1 igt0 i--) Ai Ai1
f()
18
Loop Reverse Example as enabler
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(BN-i)
Ni gt i ? merging not possible
Loop Reverse
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) CN-i g(BN-(N-i))
?
N-(N-i) i ? merging possible
Loop Merge
?
for (i0 iltN i) Bi f(Ai)
CN-i g(Bi)
19
Loop Interchange
for I1 exp1 to exp2 for I2 exp3 to exp4
A(I1, I2)
for I2 exp3 to exp4 for I1 exp1 to exp2
A(I1, I2)
?
20
Loop Interchange index traversal
j
j
Loop Interchange
i
i
for(i0 iltW i) for(j0 jltH j)
Aij
for(j0 jltH j) for(i0 iltW i)
Aij
21
Loop Interchange

Validity check dependence direction vectors.
Mostly used to improve cache behavior.
The innermost loop (loop index changes fastest)
should (only) index the right-most array index
expression in case of row-major storage, like in
C.
Can improve execution time by 1 or 2 orders of
magnitude.

22
Loop Interchange (contd)

Loop interchange can also expose parallelism.
If an inner-loop does not carry a dependence
(entry in direction vector equals ), this loop
can be executed in parallel.
Moving this loop outwards increases the
granularity of the parallel loop iterations.

23
Loop InterchangeExample of locality improvement
for (i0 iltN i) for (j0 jltM j)
Bij f(Aj,Bij-1)
for (j0 jltM j) for (i0 iltN i)
Bij f(Aj,Bij-1)

In second loop
Aj reused N times (temporal locality)
However loosing spatial locality in B (assuming
row major ordering)
gt Exploration required

24
Loop Interchange Satisfy dependencies

No loop-carried dependencies allowedunless ?
I2 prod(I2) ? cons(I2)

for (i0 iltN i) for (j0 jltM j)
Aij f(Ai-1j1)
for (i0 iltN i) for (j0 jltM j)
Aij f(Ai-1j)

Enablers
Data-flow transformations
Loop Bump

25
Loop SkewBasic transformation
for I1 exp1 to exp2 for I2 exp3 to exp4
A(I1, I2)
for I1 exp1 to exp2 for I2 exp3?.I1? to
exp4?.I1? A(I1, I2-?.I1-? )
for I1 exp1 to exp2 for I2 exp3?.I1? to
exp4?.I1? A(I1, I2-?.I1-? )
?
26
Loop Skew
Loop Skewing
for(j0jltHj) for(i0iltWi) Aij
...
for(j0 jltH j) for(i0j iltWji)
AI-jj ...
27
Loop Skew Example as enabler of regularity
improvement
for (i0 iltN i) for (j0 jltM j)
f(Aij)
for (i0 iltN i) for (ji jltiM j)
f(Aj)
Loop Skew
?
for (j0 jltNM j) for (i0 iltN i) if
(jgti jltiM) f(Aj)
28
Loop Tiling
for I 0 to exp1 . exp2 A(I)
Tile size exp1
Tile factor exp2
for I1 0 to exp2 for I2 exp1.I1 to exp1.(I1
1) A(I2)
?
29
Loop Tiling
i
Loop Tiling
j
i
for(i0ilt9i) Ai ...
for(j0 jlt3 j) for(i4j ilt4j4 i)
if (ilt9) Ai ...
30
Loop Tiling

Improve cache reuse by dividing the iteration
space into tiles and iterating over these tiles.
Only useful when working set does not fit into
cache or when there exists much interference.
Two adjacent loops can legally be tiled if they
can legally be interchanged.

31
2-D Tiling Example

for(i0 iltN i)
for(j0 jltN j)
Aij Bji

for(TI0 TIltN TI16) for(TJ0 TJltN
TJ16) for(iTI iltmin(TI16,N) i)
for(jTJ jltmin(TJ16,N) j)
Aij Bji
32
2-D Tiling

Show index space traversal

33
What is the best Tile Size?

Current tile size selection algorithms use a
cache model
Generate collection of tile sizes
Estimate resulting cache miss rate
Select best one.
Only take into account L1 cache.
Mostly do not take into account n-way
associativity.

34
Loop Index Split
for Ia exp1 to exp2 A(Ia)
for Ia exp1 to p A(Ia) for Ib p1 to
exp2 A(Ib)
?
35
Loop Unrolling

Duplicate loop body and adjust loop header.
Increases available ILP, reduces loop overhead,
and increases possibilities for common
subexpression elimination.
Always valid !!

36
(Partial) Loop Unrolling
for I exp1 to exp2 A(I)
A(exp1) A(exp11) A(exp12) A(exp2)
for I exp1/2 to exp2 /2 A(2l) A(2l1)
?
37
Loop Unrolling Downside

If unroll factor is not divisor of trip count,
then need to add remainder loop.
If trip count not known at compile time, need to
check at runtime.
Code size increases which may result in higher
I-cache miss rate.
Global determination of optimal unroll factors is
difficult.

38
Loop Unroll Example of removal of non-affine
iterator
for (L1 L lt 4 L) for (i0 i lt (1ltltL)
i) A(L,i)
for (i0 i lt 2 i) A(1,i) for (i0 i lt
4 i) A(2,i) for (i0 i lt 8 i)
A(3,i)
39
Unroll-and-Jam

Unroll outerloop and fuse new copies of the
innerloop.
Increases size of loop body and hence available
ILP.
Can also improve locality.

40
Unroll-and-Jam Example
for (i0iltNi) for
(j0jltNj) Aij Bji

for (i0iltNi) for
(j0jltNj) Aij Bji for
(j0jltNj) Ai1j Bji1

for (i0 iltN i2) for (j0 jltN j)
Aij Bji Ai1j Bji1

More ILP exposed
Spatial locality of B enhanced

41
Simplified loop transformation script

Give all loops same nesting depth
Use dummy 1-iteration loops if necessary
Improve regularity
Usually applying loop interchange or reverse
Improve locality
Usually applying loop merge
Break data dependencies with loop bump/skew
Sometimes loop index split or unrolling is easier

42
Loop transformation theory

Iteration spaces
Polyhedral model
Dependency analysis

43
Technical Preliminaries (1)
do i 2, N do j i, N xi,j0.5(xi-1,j
xi,j-1) enddo
(1)
j

Address expr (1)

perfect loop nest
iteration space
dependence vector

3
2
i
2
3
4
44
Technical Preliminaries (2)
Switch loop indexes do m 2, N do n 2, m
xn,m0.5(xn-1,m xn,m-1) enddo
(2)
(2)
affine transformation
45
Polyhedral Model

Polyhedron is set ?x Ax ? c? for some matrix A
and bounds vector c
Polyhedra (or Polytopes) are objects in a
many-dimensional space without holes
Iteration spaces of loops (with unit stride) can
be represented as polyhedra
Array accesses and loop transformations can be
represented as matrices

46
Iteration Space

A loop nest is represented as BI ? b for
iteration vector I
Example
for(i0 ilt10i) -1 0
0
for(ji jlt10j) 1 0 i
9
1 -1
j 0
0 1
9

?
47
Array Accesses

Any array access Ae1e2 for linear index
expressions e1 and e2 can be represented as an
access matrix and offset vector.
A a
This can be considered as a mapping from the
iteration space into the storage space of the
array (which is a trivial polyhedron)

48
Unimodular Matrices

A unimodular matrix T is a matrix with integer
entries and determinant ?1.
This means that such a matrix maps an object onto
another object with exactly the same number of
integer points in it.
Its inverse T¹ always exist and is unimodular as
well.

49
Types of Unimodular Transformations

Loop interchange
Loop reversal
Loop skewing for arbitrary skew factor
Since unimodular transformations are closed under
multiplication, any combination is a unimodular
transformation again.

50
Application

Transformed loop nest is given by AT¹ I ? a
Any array access matrix is transformed into AT¹.
Transformed loop nest needs to be normalized by
means of Fourier-Motzkin elimination to ensure
that loop bounds are affine expressions in more
outer loop indices.

51
Dependence Analysis

Consider following statements
S1 a b c
S2 d a f
S3 a g h
S1 ? S2 true or flow dependence RaW
S2 ? S3 anti-dependence WaR
S1 ? S3 output dependence WaW

52
Dependences in Loops

Consider the following loop
for(i0 iltN i)
S1 ai
S2 bi ai-1
Loop carried dependence S1 ? S2.
Need to detect if there exists i and i such that
i i-1 in loop space.

53
Definition of Dependence

There exists a dependence if there two statement
instances that refer to the same memory location
and (at least) one of them is a write.
There should not be a write between these two
statement instances.
In general, it is undecidable whether there exist
a dependence.

54
Direction of Dependence

If there is a flow dependence between two
statements S1 and S2 in a loop, then S1 writes to
a variable in an earlier iteration than S2 reads
that variable.
The iteration vector of the write is
lexicographically less than the iteration vector
of the read.
I ? I iff i1 i1 ??? i(k-1) i(k-1) ? ik lt
ik for some k.

55
Direction Vectors

A direction vector is a vector
(,,?,,lt,,,?,)
where can denote or lt or gt.
Such a vector encodes a (collection of)
dependence.
A loop transformation should result in a new
direction vector for the dependence that is also
lexicographically positive.

56
Loop Interchange

Interchanging two loops also interchanges the
corresponding entries in a direction vector.
Example if direction vector of a dependence is
(lt,gt) then we may not interchange the loops
because the resulting direction would be (gt,lt)
which is lexicographically negative.

57
Affine Bounds and Indices

We assume loop bounds and array index expressions
are affine expressions
a0 a1 i1 ? ak ik
Each loop bound for loop index ik is an affine
expressions over the previous loop indices i1 to
i(k-1).
Each loop index expression is a affine expression
over all loop indices.

58
Non-Affine Expressions

Index expressions like ij cannot be handled by
dependence tests. We must assume that there
exists a dependence.
An important class of index expressions are
indirections ABi. These occur frequently in
scientific applications (sparse matrix
computations).
In embedded applications???

59
Linear Diophantine Equations

A linear diophantine equations is of the form
?aj xj c
Equation has a solution iff gcd(a1,?,an) is
divisor of c

60
GCD Test for Dependence

Assume single loop and two references Aabi and
Acdi.
If there exist a dependence, then gcd(b,d)
divides (c-a).
Note the direction of the implication!
If gcd(b,d) does not divide (c-a) then there
exists no dependence.

61
GCD Test (contd)

However, if gcd(b,d) does divide (c-a) then
there might exist a dependence.
Test is not exact since it does not take into
account loop bounds.
For example
for(i0 ilt10 i)
Ai Ai10 1

62
GCD Test (contd)

Using the Theorem on linear diophantine
equations, we can test in arbitrary loop nests.
We need one test for each direction vector.
Vector (,,?,,lt,?) implies that first k indices
are the same.
See book by Zima for details.

63
Other Dependence Tests

There exist many dependence test
Separability test
GCD test
Banerjee test
Range test
Fourier-Motzkin test
Omega test
Exactness increases, but so does the cost.

64
Fourier-Motzkin Elimination

Consider a collection of linear inequalities over
the variables i1,?,in
e1(i1,?,in) ? e1(i1,?,in)
?
em(i1,?,in) ? em(i1,?,in)
Is this system consistent, or does there exist a
solution?
FM-elimination can determine this.

65
FM-Elimination (contd)

First, create all pairs L(i1,?,i(n-1)) ? in and
in ? U(i1,?,i(n-1)). This is solution for in.
Then create new system
L(i1,?,i(n-1)) ? U(i1,?,i(n-1))
together with all original inequalities not
involving in.
This new system has one variable less and we
continue this way.

66
FM-Elimination (contd)

After eliminating i1, we end up with collection
of inequalities between constants c1 ? c1.
The original system is consistent iff every such
inequality can be satisfied.
There does not exist an inequality like
10 ? 3.
There may be exponentially many new inequalities
generated!

67
Fourier-Motzkin Test

Loop bounds plus array index expressions generate
sets of inequalities, using new loop indices i
for the sink of dependence.
Each direction vector generates inequalities
i1 i1 ? i(k-1) i(k-1) ik lt ik
If all these systems are inconsistent, then there
exists no dependence.
This test is not exact (real solutions but no
integer ones) but almost.

68
N-Dimensional Arrays

Test in each dimension separately.
This can introduce another level of inaccuracy.
Some tests (FM and Omega test) can test in many
dimensions at the same time.
Otherwise, you can linearize an array Transform
a logically N-dimensional array to its
one-dimensional storage format.

69
Hierarchy of Tests

Try cheap test, then more expensive ones
if (cheap test1 NO)
then print NO
else if (test2 NO)
then print NO
else if (test3 NO)
then print NO
else ?

70
Practical Dependence Testing

Cheap tests, like GCD and Banerjee tests, can
disprove many dependences.
Adding expensive tests only disproves a few more
possible dependences.
Compiler writer needs to trade-off compilation
time and accuracy of dependence testing.
For time critical applications, expensive tests
like Omega test (exact!) can be used.

Write a Comment

User Comments (0)