Title: Loop Transformations
1Loop Transformations
- Vectorization
- Statement Reordering
- Loop distribution
2Strategies for Optimization
- Loop transformations can be used systematically
in a variety of different ways to achieve a
desired program form. - This particular strategy was developed to
generate code for vector supercomputers - It can be used for vector co-processors and with
few modifications for SIMD co-processors
3The xvector option
- Transforms loops using an intrinsic function to a
single call to a faster vectorized equivalent - for ( I0 IltN I)
- xI exp(yI)
- Syntax -xvectoryes no
- These vector functions are a part of the libmvec
- (man libmvec)
- It has to be added on the link line also
4Vectorization
Vector supercomputers, vector co-processors and
with few modifications for SIMD co-processors
- Goals
- generate array syntax
- convert Fortran 77 to Fortran 90
- create vector code for vector supercomputers or
vector co-processors - Approach
- convert as many statements in program loops as
possible into vector form
5Vector Code
- A(2 101) B(1100) A(1100)
- right hand side is evaluated, then left hand side
- Above is not the same as
- DO I 2, 101
- A(I) B(I-1) A(I -1)
- END DO
True dependence!
6Vector Code
A(2 101) B(1100) A(1100)
- is the same as
- DO I 2, 101
- TEMP(I) B(I-1) A(I-1)
- END DO
- DO I 2, 101
- A(I) TEMP(I)
- END DO
No dependences!
Finish this loop before we start next one!
No dependences!
7Vectorization
- Transformation generates a vector statement to
replace a statement within a loop nest. - It is applied to individual statements.
- But we must first generate series of loop nests
where the loop body consists of a single
statement. - Vectorization is legal if it preserves all
dependences of the original code.
8Example
DO I 1, N A(I) B(I) C(I) D(I)
A(I-1) D(I) END DO
this becomes A(1N) B(1N) C(1N)
DO I 1, N A(I) B(I) C(I) END DO DO I
1, N D(I) A(I-1) D(I) END DO
this becomes D(1N) A(0N-1) D(1N)
9Strategy
- First, loop distribution is used to replace a
loop nest by multiple loops with a single
statement each - and then vector code generation is used to
vectorize each statement in both loops. - Works so long as there is no dependence cycle
involving these statements.
10In General
- If there is a dependence with source in a
statement that comes textually after the sink
statement, we re-order the statements in the loop
first. - The vector statement defining new values must be
executed before the vector statement using them.
11Statement Reordering Example
antidep
DO I 1, N A(I) B(I) C(I) D(I)
A(I1) D(I) END DO
this becomes D(1N) A(0N-1) D(1N)
DO I 1, N D(I) A(I-1) D(I) END DO DO I
1, N A(I) B(I) C(I) END DO
DO I 1, N D(I) A(I1) D(I) A(I) B(I)
C(I) END DO
this becomes A(1N) B(1N) C(1N)
12Statement Reordering
- The statement reordering transformation has many
uses - Synopsis Exchange the position of two adjacent
statements in the body of a loop. - Legality test Valid iff there is no
loop-independent dependence between the pair of
statements.
13Statement Reordering
bef appears in source text before
- Exchanges text position of two adjacent
statements in a loop. - Given loop L containing adjacent statements S,
S, with S bef S and S S does not hold.
Creates loop L with S and S swapped. - Suppose S1 S2 for statements S1 and S2. Then
there are iteration vectors i1 and i2 such that
S1(i1) k S2(i2), and the former defines a
value used by the latter. - If k , this would change the meaning of the
loop. So this is what we must test for.
14Loop Distribution
- Used to isolate some statements in a loop nest
from other statements. - Simple form of loop distribution attempts to
transform a loop nest with multiple assignment
statements in loop body to a series of loop
nests, each containing a single assignment
statement. - It is not necessary that the statements are all
enclosed in exactly the same loop nest.
15Loop Distribution Example
DO K 1, M A(K) 0. DO J 1, N
A(K) A(K) B(J,K) END DO END DO
DO K 1, M A(K) 0. END DO DO K 1, M
DO J 1, N A(K) A(K) B(J,K) END
DO END DO
becomes
16Validity of Loop Distribution
- Let S, S be statements in a loop L.
- A dependence S S is backward iff S bef S.
- We call all other dependences forward.
- A backward dependence is loop-carried.
Distribution of L would destroy it. - Therefore loop distribution may be applied to a
loop L iff all of its dependences are forward.
17Illegal Loop Distribution Example
S2 S1
DO I 1, N S1 A(I) B(I) C(I-1) S2 C(I)
B(I) END DO
DO I 1, N S1 A(I) B(I) C(I-1) END DO
DO I 1, N S2 C(I) B(I) END DO
is not equivalent to
Of course, if we had applied statement reordering
first.
18Loop Distribution
- Let L be a loop and L be the code obtained by a
legal distribution of loop L. - Then L preserves the type of all dependences in
L. But it does not preserve the level. - If S S in L, then S S in L.
- If S c S in L. then S c S in L.
- If S c S in L, where S ! S, then S S
in L.
19Loop Distribution
- A loop L can be transformed into an equivalent
loop L with no backward dependences - by a sequence of valid statement reordering
transformations - iff its dependence graph is
- either acyclic or
- contains only single-statement cycles.
- If vectorization is the objective, it can
immediately follow
20Limits of Loop Distribution
- If there is a non-trivial cycle in the dependence
graph, - e.g. S S and S S
- then any legal statement reordering will exchange
the dependences. - But this does not eliminate backward edges, so
loop distribution cannot be applied to the
resulting loop.
21Loop Distribution
- Assume loop L has an acyclic dependence graph.
- 1. Number nodes of graph S1, .., Sn such that if
Sj Sk, then j lt k. - Then if S bef S1, S S1 does not hold.
- 2. So S1 can be moved to the first position in
the loop by legal statement reorderings, etc. - 3. Now assume the first k statements in loop are
S1,..,Sk. Let S be statement immediately before
Sk1. - Since S Sk1 cannot hold, Sk1 can be reordered
to follow Sk. - This has only forward edges.
22Loop Fission
- Name sometimes given to more general form of loop
distribution - Some loops are only partially distributive.
- The general transformation separates distributive
statements from non-distributive code. - The former can then be vectorized.
23Loop Fission Example
- DO I 1, N
- DO J 1, M
- S1 A(I,J) ..
- S2 A(I,J1) A(I,J)
- END DO
- S3 B(I) X Y(I)
- S4 D(I) B(I) 1
- END DO
dt
do
S3
dt
S4
24Loop Fission Example
- DO I 1, N
- DO J 1, M
- S1 A(I,J) ..
- S2 A(I,J1) A(I,J)
- END DO
- END DO
vectorizable
DO I 1, N S3 B(I) X Y(I) END DO DO I
1, N S4 D(I) B(I) 1 END DO
Not vectorizable
25Loop Fission Example
- DO I 1, N
- DO J 1, M
- S1 A(I,J) ..
- S2 A(I,J1) A(I,J)
- END DO
- END DO
- S3 B(1N) X Y(1N)
- S4 D(1N) B(1N) 1
dt
do
26Vector Code Generation
- Replaces single-statement loop that is not
recurrence by vector (array) form - DO I L, U
- A(I) B(I) C(I1)
- END DO
- becomes
- A(LU) B(LU) C(L1U1)
27General Vector Code Generation
- Find cycles in dependence graph D.
- 2. Create the acyclic condensation (this is new
graph where each cycle is represented by a node)
AC(D). Order its nodes. - 3. Perform sequence of statement reordering
transformations so that all statements in a cycle
are adjacent, and the order of statements in
different cycles corresponds to their order in
AC(D). - 4. Apply loop distribution to this loop.
28Example
S1
- DO I L, U
- S1 A(I) B(I) C(I1)
- S2 B(I1) A(I-1) D(I)
- S3 E(I) E(I-1) 1
- S4 C(I2) E(I) F(I)
- END DO
- S1 and S2 are in a dependence cycle
- S3 has a single statement cycle
- S3 d S4 and S4 d S1
dt
dt
S2
dt
dt
S3
dt
S4
29Example
- The acyclic condensation is
- DO I L, U
- S3 E(I) E(I-1) 1
- END DO
- DO I L, U
- S4 C(I2) E(I) F(I)
- END DO
- DO I L, U
- S1 A(I) B(I) C(I1)
- S2 B(I1) A(I-1) D(I)
- END DO
S3
S4
S1,S2
30Example
- Only the second loop is vectorizable
- DO I L, U
- S3 E(I) E(I-1) 1
- END DO
- S4 C(L2U2) E(LU) F(LU)
- DO I L, U
- S1 A(I) B(I) C(I1)
- S2 B(I1) A(I-1) D(I)
- END DO
recurrence
Dependence cycle
31Vector Code Algorithm
- This does not suffice to deal well with many loop
nests encountered in practice. - Fortunately, we can still improve upon this
result in general.
- We
- can sometimes further transform a loop nest so
that the vector code transformation is applicable - can exploit opportunities for vectorization of
some levels of a loop nest. - may also apply additional dependence-breaking
transformations.
32Scalar Expansion
DO K 1,N DO I 1, N T(I)
T(I) A(K)
DO K 1,N DO I 1, N T T
A(K)
- This transformation creates a copy of a scalar
variable for each iteration of the loop nest in
which it occurs by replacing the variable with an
appropriately dimensioned array. - This transformation may eliminate dependences
while preserving the semantics of the code. - It is particularly useful in vectorizing and
shared memory parallelization.
33Scalar Expansion
- Replaces a scalar variable by an array of the
same size as a loops iteration space. - Input perfect nest L of n loops with scalar A on
left hand side of assignment in L. - A is not a formal parameter, induction variable,
single- statement reduction or element in common
block. - The kth loop in L has lower bound Lk and upper
bound Uk. - Output modified loop where all references to A
are replaced by a reference to an n-dimensional
array A(L1U1,,LnUn), followed by assignment
to A immediately after loop nest - A A( U1, U2, , Un)
34Scalar Expansion
- If the first occurrence of A in L is a use, we
modify the above as follows - the lower bound in the last dimension is (Ln-1)
instead. - An assignment to A is inserted immediately
before the loop A(L1,..,Ln-1) A. - All uses of A before the first definition in L
are replaced by A(I1,I2,..,In-1). All other
occurrences are replaced as previously.
35Idiom Recognition
Maybe we can handle this after all
DO I L, U E(I) E(I-1) 1 END DO
- Recognize frequent (small) computations that can
be tuned to target architecture, and perform
specific optimizations. - Examples
- dot product,
- max and min values of array
- maxloc and minloc of array
- scalar product
- matrix multiplication
36Loop Interchange
- Powerful transformation to exchange order of two
loops - Reorders execution of statement instances, so
only legal if this does not change semantics of
program - Has many uses, e.g.
- Move vectorizable loops to innermost position of
loop - Move parallelizable loops to outermost position
to increase granularity and decrease
synchronization
37Loop Interchange
- Rearrange execution order of statement instances
in loop - Goals increase use of cache, move dependence
cycles - Legality valid if data dependences are preserved
- Input a (perfectly) nested loop L of depth n gt
1, interchange level c lt n - Output loop nest L that is obtained from L by
exchanging the order of the DO loop headers for
Lc and Lc1 - Generalizations to imperfectly nested loops, and
to pairs of non-adjacent loops
38Loop Interchange Example
- DO J 1, M
- DO I 1, N
- A(I,J1) A( I1, J) B
- END DO
- END DO
DO I 1, N ? DO J 1, M
A(2,2) is defined in iteration (1,2) and used in
(2,1)
39Loop Interchange Example
dir(gt,lt)
da
- Loop interchange is not legal for this loop
- DO J 1, M
- DO I 1, N
- A(I,J1) A( I1, J) B
- END DO
- END DO
- A(2,2) will get the wrong value
DO I 1, N DO J 1, M
A(2,2) is defined in iteration (2,1) and used in
(1,2)
40Loop Interchange
- The direction vector records the loop levels
associated with each dependence. - After interchanging a pair of adjacent loops, the
new direction vector is formed by exchanging the
corresponding entries in the direction vector. - If loop interchanges swaps the references
involved in a dependence, then it will change the
semantics of the program. - A dependence with direction vector (lt,gt) would
become (gt,lt) if the loops are swapped. But this
does reverse the access pattern!
41Loop Interchange
- So we must test whether there are direction
vectors of the form (lt,gt) involving the loop
levels that we want to exchange. - If not we can apply the transformation.
- If there are such direction vectors, the
transformation is illegal and cannot be applied.
42Legal Loop Interchange
- The transformation exchanging loops c, c1 does
not modify any loop dependences at levels other
than c, c1. - If dir(i,i) is (c-1,lt,,,..,) then in the
modified loop this will become (c,lt,,..,). If
dir(i,i) is ( c-1, lt, lt,,,) it remains
unchanged. - Thus if S(i) d k S(i) for any 1 k lt c or
c1 lt k n, or if k then S(i) d k
S(i) in the modified loop. - If S(i) d c1 S(i), then in the new loop S(j)
d c S(j)
If a loop at some level p is not involved in any
loop-carried dependences, then it may be moved
inward to any other level.
43Example
- DO J 1, 100
DO K 1, 100 - DO I 1, 100
DO I 1, 100 - DO K 1, 100
DO J 1, 100 - C(I,J) C(I,J) A(I,K) B(K,J)
C(I,J) C(I,J) A(I,K) B(K,J) - END DO
END DO - END DO
END DO - END DO
END DO - Version on left has true dependence in innermost
loop. - In version on right, loop level 1 has been
interchanged with loop level 3. This is legal
because the level 1 loop has no dependences.
44Example
Here, there is a dependence at the innermost level
- DO I 1, L
- DO J 1, M
- C(I,J) 0.0
- DO K 1, N
- C(I,J) C(I,J) A(I,K) B(K,J)
- END DO
- END DO
dt
dt
There are many ways to vectorize matrix
multiplication
45Example
- DO I 1, L
- DO J 1, M
- C(I,J) 0.0
- ENDDO
- ENDDO
- DO K 1, N
- DO I 1, L
- DO J 1, M
- C(I,J) C(I,J) A(I,K) B(K,J)
- END DO END DO
- END DO
Apply loop distribution and then loop interchange
to move dependence to outermost level
46Example
- C(1L,1M) 0.0
- DO K 1, N
- C(1L, 1M) C(1L, 1M) A(1L,K)
B(K,1M) - END DO
- We now have vector code, although this may not be
the most suitable form for many machines - It may be more appropriate to vectorize in one
dimension only and to apply strip mining to it
47Example
- DO J 1, M
- C(1L, J ) 0.0
- DO K 1, N
- C(1L, J) C(1L, J) A(1L,K)
B(K,J) - END DO
- END DO
- is vectorized in a single dimension
48Example
- DO J 1, M, 32
- DO J J, J31
- C(1L, J ) 0.0
- DO K 1, N
- C(1L, J) C(1L, J) A(1L,K)
B(K,J) - END DO
- END DO
- breaks up work into vectors of length 32
49Example
Here, the problem is that the updates use a
temporary scalar, T. We cannot apply scalar
forward substitution.
- DO I 1, L
- DO J 1, M
- T 0.0
- DO K 1, N
- T T A(I,K) B(K,J)
- END DO
- C(I,J) T
- END DO
50Scalar Expansion
Scalar expansion generates an array that replaces
the scalar T
- DO I 1, L
- DO J 1, M
- T ( I, J) 0.0
- DO K 1, N
- T(I,J) T(I,J) A(I,K)
B(K,J) - END DO
- C(I,J) T (I,J)
- END DOs
Unfortunately, this is very wasteful of memory!
51Strip Mining
- DO I 1, L
- DO J 1, M, 32
- DO J J, J 31
- T ( I, J ) 0.0
- DO K 1, N
- T(I, J ) T(I, J )
A(I,K) B(K,J) - END DO
- C( I,J) T ( I, J )
- END DO
- END DO
- END DO
Strip mining creates vectors with length equal to
that of the vector registers (here, we use 32)
52Software Pipelining
- Recall this overlaps execution of multiple
iterations of a loop nest - To exploit multiple hardware resources
concurrently - Requires data dependence testing to determine
legality - Some transformations may improve results
- Loop unrolling may increase workload in an
iteration to enable more efficient schedule - For large loops, loop distribution may reduce the
amount of data used in an iteration to avoid
register spills