Loop Transformations - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Loop Transformations

Description:

Loop transformations can be used systematically in a variety of different ways ... This particular strategy was developed to generate ... Idiom Recognition ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 53

Provided by: barbara179

Category:

more less

Transcript and Presenter's Notes

Title: Loop Transformations

1
Loop Transformations

Vectorization
Statement Reordering
Loop distribution

2
Strategies for Optimization

Loop transformations can be used systematically
in a variety of different ways to achieve a
desired program form.
This particular strategy was developed to
generate code for vector supercomputers
It can be used for vector co-processors and with
few modifications for SIMD co-processors

3
The xvector option

Transforms loops using an intrinsic function to a
single call to a faster vectorized equivalent
for ( I0 IltN I)
xI exp(yI)
Syntax -xvectoryes no
These vector functions are a part of the libmvec
(man libmvec)
It has to be added on the link line also

4
Vectorization
Vector supercomputers, vector co-processors and
with few modifications for SIMD co-processors

Goals
generate array syntax
convert Fortran 77 to Fortran 90
create vector code for vector supercomputers or
vector co-processors
Approach
convert as many statements in program loops as
possible into vector form

5
Vector Code

A(2 101) B(1100) A(1100)
right hand side is evaluated, then left hand side
Above is not the same as
DO I 2, 101
A(I) B(I-1) A(I -1)
END DO

True dependence!
6
Vector Code
A(2 101) B(1100) A(1100)

is the same as
DO I 2, 101
TEMP(I) B(I-1) A(I-1)
END DO
DO I 2, 101
A(I) TEMP(I)
END DO

No dependences!
Finish this loop before we start next one!
No dependences!
7
Vectorization

Transformation generates a vector statement to
replace a statement within a loop nest.
It is applied to individual statements.
But we must first generate series of loop nests
where the loop body consists of a single
statement.
Vectorization is legal if it preserves all
dependences of the original code.

8
Example
DO I 1, N A(I) B(I) C(I) D(I)
A(I-1) D(I) END DO
this becomes A(1N) B(1N) C(1N)

is equivalent to-

DO I 1, N A(I) B(I) C(I) END DO DO I
1, N D(I) A(I-1) D(I) END DO
this becomes D(1N) A(0N-1) D(1N)
9
Strategy

First, loop distribution is used to replace a
loop nest by multiple loops with a single
statement each
and then vector code generation is used to
vectorize each statement in both loops.
Works so long as there is no dependence cycle
involving these statements.

10
In General

If there is a dependence with source in a
statement that comes textually after the sink
statement, we re-order the statements in the loop
first.
The vector statement defining new values must be
executed before the vector statement using them.

11
Statement Reordering Example
antidep
DO I 1, N A(I) B(I) C(I) D(I)
A(I1) D(I) END DO
this becomes D(1N) A(0N-1) D(1N)
DO I 1, N D(I) A(I-1) D(I) END DO DO I
1, N A(I) B(I) C(I) END DO

is equivalent to-

DO I 1, N D(I) A(I1) D(I) A(I) B(I)
C(I) END DO
this becomes A(1N) B(1N) C(1N)
12
Statement Reordering

The statement reordering transformation has many
uses
Synopsis Exchange the position of two adjacent
statements in the body of a loop.
Legality test Valid iff there is no
loop-independent dependence between the pair of
statements.

13
Statement Reordering
bef appears in source text before

Exchanges text position of two adjacent
statements in a loop.
Given loop L containing adjacent statements S,
S, with S bef S and S S does not hold.
Creates loop L with S and S swapped.
Suppose S1 S2 for statements S1 and S2. Then
there are iteration vectors i1 and i2 such that
S1(i1) k S2(i2), and the former defines a
value used by the latter.
If k , this would change the meaning of the
loop. So this is what we must test for.

14
Loop Distribution

Used to isolate some statements in a loop nest
from other statements.
Simple form of loop distribution attempts to
transform a loop nest with multiple assignment
statements in loop body to a series of loop
nests, each containing a single assignment
statement.
It is not necessary that the statements are all
enclosed in exactly the same loop nest.

15
Loop Distribution Example
DO K 1, M A(K) 0. DO J 1, N
A(K) A(K) B(J,K) END DO END DO
DO K 1, M A(K) 0. END DO DO K 1, M
DO J 1, N A(K) A(K) B(J,K) END
DO END DO
becomes
16
Validity of Loop Distribution

Let S, S be statements in a loop L.
A dependence S S is backward iff S bef S.
We call all other dependences forward.
A backward dependence is loop-carried.
Distribution of L would destroy it.
Therefore loop distribution may be applied to a
loop L iff all of its dependences are forward.

17
Illegal Loop Distribution Example
S2 S1
DO I 1, N S1 A(I) B(I) C(I-1) S2 C(I)
B(I) END DO
DO I 1, N S1 A(I) B(I) C(I-1) END DO
DO I 1, N S2 C(I) B(I) END DO
is not equivalent to
Of course, if we had applied statement reordering
first.
18
Loop Distribution

Let L be a loop and L be the code obtained by a
legal distribution of loop L.
Then L preserves the type of all dependences in
L. But it does not preserve the level.
If S S in L, then S S in L.
If S c S in L. then S c S in L.
If S c S in L, where S ! S, then S S
in L.

19
Loop Distribution

A loop L can be transformed into an equivalent
loop L with no backward dependences
by a sequence of valid statement reordering
transformations
iff its dependence graph is
either acyclic or
contains only single-statement cycles.
If vectorization is the objective, it can
immediately follow

20
Limits of Loop Distribution

If there is a non-trivial cycle in the dependence
graph,
e.g. S S and S S
then any legal statement reordering will exchange
the dependences.
But this does not eliminate backward edges, so
loop distribution cannot be applied to the
resulting loop.

21
Loop Distribution

Assume loop L has an acyclic dependence graph.
1. Number nodes of graph S1, .., Sn such that if
Sj Sk, then j lt k.
Then if S bef S1, S S1 does not hold.
2. So S1 can be moved to the first position in
the loop by legal statement reorderings, etc.
3. Now assume the first k statements in loop are
S1,..,Sk. Let S be statement immediately before
Sk1.
Since S Sk1 cannot hold, Sk1 can be reordered
to follow Sk.
This has only forward edges.

22
Loop Fission

Name sometimes given to more general form of loop
distribution
Some loops are only partially distributive.
The general transformation separates distributive
statements from non-distributive code.
The former can then be vectorized.

23
Loop Fission Example

DO I 1, N
DO J 1, M
S1 A(I,J) ..
S2 A(I,J1) A(I,J)
END DO
S3 B(I) X Y(I)
S4 D(I) B(I) 1
END DO

dt
do
S3
dt
S4
24
Loop Fission Example

DO I 1, N
DO J 1, M
S1 A(I,J) ..
S2 A(I,J1) A(I,J)
END DO
END DO

vectorizable
DO I 1, N S3 B(I) X Y(I) END DO DO I
1, N S4 D(I) B(I) 1 END DO
Not vectorizable
25
Loop Fission Example

DO I 1, N
DO J 1, M
S1 A(I,J) ..
S2 A(I,J1) A(I,J)
END DO
END DO
S3 B(1N) X Y(1N)
S4 D(1N) B(1N) 1

dt
do
26
Vector Code Generation

Replaces single-statement loop that is not
recurrence by vector (array) form
DO I L, U
A(I) B(I) C(I1)
END DO
becomes
A(LU) B(LU) C(L1U1)

27
General Vector Code Generation

Find cycles in dependence graph D.
2. Create the acyclic condensation (this is new
graph where each cycle is represented by a node)
AC(D). Order its nodes.
3. Perform sequence of statement reordering
transformations so that all statements in a cycle
are adjacent, and the order of statements in
different cycles corresponds to their order in
AC(D).
4. Apply loop distribution to this loop.

28
Example
S1

DO I L, U
S1 A(I) B(I) C(I1)
S2 B(I1) A(I-1) D(I)
S3 E(I) E(I-1) 1
S4 C(I2) E(I) F(I)
END DO
S1 and S2 are in a dependence cycle
S3 has a single statement cycle
S3 d S4 and S4 d S1

dt
dt
S2
dt
dt
S3
dt
S4
29
Example

The acyclic condensation is
DO I L, U
S3 E(I) E(I-1) 1
END DO
DO I L, U
S4 C(I2) E(I) F(I)
END DO
DO I L, U
S1 A(I) B(I) C(I1)
S2 B(I1) A(I-1) D(I)
END DO

S3
S4
S1,S2
30
Example

Only the second loop is vectorizable
DO I L, U
S3 E(I) E(I-1) 1
END DO
S4 C(L2U2) E(LU) F(LU)
DO I L, U
S1 A(I) B(I) C(I1)
S2 B(I1) A(I-1) D(I)
END DO

recurrence
Dependence cycle
31
Vector Code Algorithm

This does not suffice to deal well with many loop
nests encountered in practice.
Fortunately, we can still improve upon this
result in general.

We
can sometimes further transform a loop nest so
that the vector code transformation is applicable
can exploit opportunities for vectorization of
some levels of a loop nest.
may also apply additional dependence-breaking
transformations.

32
Scalar Expansion
DO K 1,N DO I 1, N T(I)
T(I) A(K)
DO K 1,N DO I 1, N T T
A(K)

This transformation creates a copy of a scalar
variable for each iteration of the loop nest in
which it occurs by replacing the variable with an
appropriately dimensioned array.
This transformation may eliminate dependences
while preserving the semantics of the code.
It is particularly useful in vectorizing and
shared memory parallelization.

33
Scalar Expansion

Replaces a scalar variable by an array of the
same size as a loops iteration space.
Input perfect nest L of n loops with scalar A on
left hand side of assignment in L.
A is not a formal parameter, induction variable,
single- statement reduction or element in common
block.
The kth loop in L has lower bound Lk and upper
bound Uk.
Output modified loop where all references to A
are replaced by a reference to an n-dimensional
array A(L1U1,,LnUn), followed by assignment
to A immediately after loop nest
A A( U1, U2, , Un)

34
Scalar Expansion

If the first occurrence of A in L is a use, we
modify the above as follows
the lower bound in the last dimension is (Ln-1)
instead.
An assignment to A is inserted immediately
before the loop A(L1,..,Ln-1) A.
All uses of A before the first definition in L
are replaced by A(I1,I2,..,In-1). All other
occurrences are replaced as previously.

35
Idiom Recognition
Maybe we can handle this after all
DO I L, U E(I) E(I-1) 1 END DO

Recognize frequent (small) computations that can
be tuned to target architecture, and perform
specific optimizations.
Examples
dot product,
max and min values of array
maxloc and minloc of array
scalar product
matrix multiplication

36
Loop Interchange

Powerful transformation to exchange order of two
loops
Reorders execution of statement instances, so
only legal if this does not change semantics of
program
Has many uses, e.g.
Move vectorizable loops to innermost position of
loop
Move parallelizable loops to outermost position
to increase granularity and decrease
synchronization

37
Loop Interchange

Rearrange execution order of statement instances
in loop
Goals increase use of cache, move dependence
cycles
Legality valid if data dependences are preserved
Input a (perfectly) nested loop L of depth n gt
1, interchange level c lt n
Output loop nest L that is obtained from L by
exchanging the order of the DO loop headers for
Lc and Lc1
Generalizations to imperfectly nested loops, and
to pairs of non-adjacent loops

38
Loop Interchange Example

DO J 1, M
DO I 1, N
A(I,J1) A( I1, J) B
END DO
END DO

DO I 1, N ? DO J 1, M
A(2,2) is defined in iteration (1,2) and used in
(2,1)
39
Loop Interchange Example
dir(gt,lt)
da

Loop interchange is not legal for this loop
DO J 1, M
DO I 1, N
A(I,J1) A( I1, J) B
END DO
END DO
A(2,2) will get the wrong value

DO I 1, N DO J 1, M
A(2,2) is defined in iteration (2,1) and used in
(1,2)
40
Loop Interchange

The direction vector records the loop levels
associated with each dependence.
After interchanging a pair of adjacent loops, the
new direction vector is formed by exchanging the
corresponding entries in the direction vector.
If loop interchanges swaps the references
involved in a dependence, then it will change the
semantics of the program.
A dependence with direction vector (lt,gt) would
become (gt,lt) if the loops are swapped. But this
does reverse the access pattern!

41
Loop Interchange

So we must test whether there are direction
vectors of the form (lt,gt) involving the loop
levels that we want to exchange.
If not we can apply the transformation.
If there are such direction vectors, the
transformation is illegal and cannot be applied.

42
Legal Loop Interchange

The transformation exchanging loops c, c1 does
not modify any loop dependences at levels other
than c, c1.
If dir(i,i) is (c-1,lt,,,..,) then in the
modified loop this will become (c,lt,,..,). If
dir(i,i) is ( c-1, lt, lt,,,) it remains
unchanged.
Thus if S(i) d k S(i) for any 1 k lt c or
c1 lt k n, or if k then S(i) d k
S(i) in the modified loop.
If S(i) d c1 S(i), then in the new loop S(j)
d c S(j)

If a loop at some level p is not involved in any
loop-carried dependences, then it may be moved
inward to any other level.
43
Example

DO J 1, 100
DO K 1, 100
DO I 1, 100
DO I 1, 100
DO K 1, 100
DO J 1, 100
C(I,J) C(I,J) A(I,K) B(K,J)
C(I,J) C(I,J) A(I,K) B(K,J)
END DO
END DO
END DO
END DO
END DO
END DO
Version on left has true dependence in innermost
loop.
In version on right, loop level 1 has been
interchanged with loop level 3. This is legal
because the level 1 loop has no dependences.

44
Example
Here, there is a dependence at the innermost level

DO I 1, L
DO J 1, M
C(I,J) 0.0
DO K 1, N
C(I,J) C(I,J) A(I,K) B(K,J)
END DO
END DO

dt
dt
There are many ways to vectorize matrix
multiplication
45
Example

DO I 1, L
DO J 1, M
C(I,J) 0.0
ENDDO
ENDDO
DO K 1, N
DO I 1, L
DO J 1, M
C(I,J) C(I,J) A(I,K) B(K,J)
END DO END DO
END DO

Apply loop distribution and then loop interchange
to move dependence to outermost level
46
Example

C(1L,1M) 0.0
DO K 1, N
C(1L, 1M) C(1L, 1M) A(1L,K)
B(K,1M)
END DO
We now have vector code, although this may not be
the most suitable form for many machines
It may be more appropriate to vectorize in one
dimension only and to apply strip mining to it

47
Example

DO J 1, M
C(1L, J ) 0.0
DO K 1, N
C(1L, J) C(1L, J) A(1L,K)
B(K,J)
END DO
END DO
is vectorized in a single dimension

48
Example

DO J 1, M, 32
DO J J, J31
C(1L, J ) 0.0
DO K 1, N
C(1L, J) C(1L, J) A(1L,K)
B(K,J)
END DO
END DO
breaks up work into vectors of length 32

49
Example
Here, the problem is that the updates use a
temporary scalar, T. We cannot apply scalar
forward substitution.

DO I 1, L
DO J 1, M
T 0.0
DO K 1, N
T T A(I,K) B(K,J)
END DO
C(I,J) T
END DO

50
Scalar Expansion
Scalar expansion generates an array that replaces
the scalar T

DO I 1, L
DO J 1, M
T ( I, J) 0.0
DO K 1, N
T(I,J) T(I,J) A(I,K)
B(K,J)
END DO
C(I,J) T (I,J)
END DOs

Unfortunately, this is very wasteful of memory!
51
Strip Mining

DO I 1, L
DO J 1, M, 32
DO J J, J 31
T ( I, J ) 0.0
DO K 1, N
T(I, J ) T(I, J )
A(I,K) B(K,J)
END DO
C( I,J) T ( I, J )
END DO
END DO
END DO

Strip mining creates vectors with length equal to
that of the vector registers (here, we use 32)
52
Software Pipelining

Recall this overlaps execution of multiple
iterations of a loop nest
To exploit multiple hardware resources
concurrently
Requires data dependence testing to determine
legality
Some transformations may improve results
Loop unrolling may increase workload in an
iteration to enable more efficient schedule
For large loops, loop distribution may reduce the
amount of data used in an iteration to avoid
register spills