Allen and Kennedy, Chapter 13

About This Presentation

Title:

Allen and Kennedy, Chapter 13

Description:

Optimizing Compilers for Modern Architectures. Allen and Kennedy, Chapter 13 ... Optimizing Compilers for Modern Architectures. Fortran 90. Fortran 90: ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 41

Provided by: ans115

Category:

more less

Transcript and Presenter's Notes

Title: Allen and Kennedy, Chapter 13

1
Compiling Array Assignments

Allen and Kennedy, Chapter 13

2
Fortran 90

Fortran 90 successor to Fortran 77
Slow to gain acceptance
Need better/smarter compiler techniques to
achieve same level of performance as Fortran 77
compilers
This chapter focuses on a single new feature -
the array assignment statement A1100 2.0
Intended to provide direct mechanism to specify
parallel/vector execution
This statement must be implemented for the
specific available hardware. In an uniprocessor,
the statement must be converted to a scalar loop
Scalarization

3
Fortran 90

Range of a vector operation in Fortran 90 denoted
by a triplet ltlower bound upper bound
incrementgt
A11002 B251 3.0
Semantics of Fortran 90 require that for vector
statements, all inputs to the statement are
fetched before any results are stored

4
Outline

Simple scalarization
Safe scalarization
Techniques to improve on safe scalarization
Loop reversal
Input prefetching
Loop splitting
Multidimensional scalarization
A framework for analyzing multidimensional
scalarization

5
Scalarization

Replace each array assignment by a corresponding
DO loop
Is it really that easy?
Two key issues
Wish to avoid generating large array
temporaries
Wish to optimize loops to exhibit good memory
hierarchy
performance

6
Simple Scalarization

Consider the vector statement
A(1200) 2.0 A(1200)
A scalar implementation
S1 DO I 1, 200
S2 A(I) 2.0 A(I)
ENDDO
However, some statements cause problems
A(2201) 2.0 A(1200)
If we naively scalarize
DO i 1, 200
A(i1) 2.0 A(i)
ENDDO

7
Simple Scalarization

Meaning of statements changed by the
scalarization process scalarization faults
Naive algorithm which ignores such scalarization
faults
procedure SimpleScalarize(S)
let V0 (L0 U0 D0 ) be the vector
iteration specifier on left side of S
// Generate the scalarizing loop
let I be a new loop index variable
generate the statement DO I L0 ,U0 ,D
for each vector specifier V (L U D) in S
do
replace V with (IL-L0 )
generate an ENDDO statement
end SimpleScalarize

8
Scalarization Faults

Why do scalarization faults occur?
Vector operation semantics All values from the
RHS of the assignment should be fetched before
storing into the result
If a scalar operation stores into a location
fetched by a later operation, we get a
scalarization fault
Principle 13.1 A vector assignment generates a
scalarization fault if and only if the scalarized
loop carries a true dependence.
These dependences are known as scalarization
dependences
To preserve correctness, compiler should never
produce a scalarization dependence

9
Safe Scalarization

Naive algorithm for safe scalarization Use
temporary storage to make sure scalarization
dependences are not created
Consider
A(2201) 2.0 A(1200)
can be split up into
T(1200) 2.0 A(1200)
A(2201) T(1200)
Then scalarize using SimpleScalarize
DO I 1, 200
T(I) 2.0 A(I)
ENDDO
DO I 2, 201
A(I) T(I-1)
ENDDO

10
Safe Scalarization

Procedure SafeScalarize implements this method of
scalarization
Good news
Scalarization always possible by using
temporaries
Bad News
Substantial increase in memory use due to
temporaries
More memory operations per array element
We shall look at a number of techniques to reduce
the effects of these disadvantages

11
Loop Reversal

A(2256) A(1255) 1.0
SimpleScalarize will produce a scalarization
fault
Solution Loop reversal
DO I 256, 2, -1
A(I) A(I1) 1.0
ENDDO

12
Loop Reversal

When can we use loop reversal?
Loop reversal maps dependences into
antidependences
But also maps antidependences into dependences
A(2257) ( A(1256) A(3258) ) / 2.0
After scalarization
DO I 2, 257
A(I) ( A(I-1) A(I1) ) / 2.0
ENDDO
Loop Reversal gets us
DO I 257, 2
A(I) ( A(I-1) A(I1) ) / 2.0
ENDDO
Thus, cannot use loop reversal in presence of
antidependences

13
Input Prefetching

A(2257) ( A(1256) A(3258) ) / 2.0
Causes a scalarization fault when naively
scalarized to
DO I 2, 257
A(I) ( A(I-1) A(I1) ) / 2.0
ENDDO
Problem Stores into first element of the LHS in
the previous iteration
Input prefetching Use scalar temporaries to
store elements of input and output arrays

14
Input Prefetching

A first-cut at using temporaries
DO I 2, 257
T1 A(I-1)
T2 ( T1 A(I1) ) / 2.0
A(I) T2
ENDDO
T1 holds element of input array, T2 holds element
of output array
But this faces the same problem. Can correct by
moving assignment to T1 into previous
iteration...

15
Input Prefetching

T1 A(1)
DO I 2, 256
T2 ( T1 A(I1) ) / 2.0
T1 A(I)
A(I) T2
ENDDO
T2 ( T1 A(257) ) / 2.0
A(I) T2
Note We are using scalar replacement, but the
motivation for doing so is different than in
Chapter 8

16
Input Prefetching

Already seen in Chapter 8, we need as many
temporaries as the dependence threshold 1.
Example
DO I 2, 257
A(I2) A(I) 1.0
ENDDO

Can be changed to
T1 A(1)
T2 A(2)
DO I 2, 255
T3 T1 1.0
T1 T2
T2 A(I2)
A(I2) T3
ENDDO
T3 T1 1.0
T1 T2
A(258) T3
T3 T1 1.0
A(259) T3

17
Input Prefetching

Can also unroll the loop and eliminate register
to register copies
Principle 13.2 Any scalarization dependence with
a threshold known at compile time can be
corrected by input prefetching.

18
Input Prefetching

Sometimes, even when a scalarization dependence
does not have a constant threshold, input
prefetching can be used effectively
A(1N) A(1N) / A(1)
which can be naively scalarized as
DO i 1, N
A(i) A(i) / A(1)
ENDDO
true dependence from first iteration to every
other iteration
antidependence from first iteration to itself
Via input prefetching, we get
tA1 A(1)
DO i 1, N
A(i) A(i) / tA1
ENDDO

19
Loop Splitting

Problem with using input prefetching with
thresholds gt 1 Temporaries must be saved for
each iteration of the loop up to the threshold
Can potentially use loop splitting to solve this
problem
A(36) ( A(14) A(58) ) / 2.0
Scalarization loop
DO I 3, 6
A(I) (A(I-2) A(I2) ) / 2.0
ENDDO
True dependence and antidependence with threshold
of 2
Split up into two independent loops which do not
interact with each other...

20
Loop Splitting

DO I 3, 6
A(I) (A(I-2) A(I2) ) / 2.0
ENDDO
Both true and anti dependence has threshold of 2
With loop splitting becomes
DO I 3, 5, 2
A(I) (A(I-2) A(I2) ) / 2.0
ENDDO
DO I 4, 6, 2
A(I) (A(I-2) A(I2) ) / 2.0
ENDDO
Note that the threshold becomes 1 for each loop.
Could have produced incorrect results if
threshold of antidependence was not divisible by
2

21
Loop Splitting

Can write the splitting as a nested pair of
loops
DO i1 3, 4
DO i2 i1, 6, 2
A(i2) (A(i2-2) A(i22) ) / 2.0
ENDDO
ENDDO
The inner loop carries a scalarization dependence
with threshold 1 and the outer loop carries no
dependence. Can apply input prefetching
DO i1 3, 4
T1 A(i1-2)
DO i2 i1, 6, 2
T2 (T1 A(i2 2)) / 2.0
T1 A(i2)
A(i2) T2
ENDDO
ENDDO

22
Loop Splitting

Principle 13.3 Any scalarization loop in which
all true dependences have the same constant
threshold T and all antidependences have a
threshold that is divisible by T can be
transformed, using input prefetching and loop
splitting, so that all scalarization dependences
are eliminated.

23
Scalarization Algorithm

Revised scalarization Algorithm
procedure FullScalarize
for each vector statement S do begin
compute the dependences of S on itself as
though S had been scalarized
if S has no scalarization dependences upon
itself then SimpleScalarize(S)
else if S has scalarization dependences, but
no self antidependences
then begin
SimpleScalarize(S)
reverse the scalarization loop
end

24
Scalarization Algorithm

else if all scalarization dependences have a
threshold of 1 then begin
SimpleScalarize(S)
InputPrefetch(S)
end
else if all scalarization dependences for S
have the same constant threshold T
and all antidependences have thresholds that
are divisible by T
then SplitLoop(S)
else if all antidependences for S have the
same constant threshold T
and all true dependences have thresholds that
are divisible by T then begin
reverse the loop
SplitLoop(S)
end
else SafeScalarize(S, SL)
end
end FullScalarize

25
Multidimensional Scalarization

Vector statements in Fortran 90 in more than 1
dimension
A(1100, 1100) B(1100, 1, 1100)
corresponds to
DO J 1, 100
A(1100, J) B(1100, 1, J)
ENDDO
Scalarization in multiple dimensions
A(1100, 1100) 2.0 A(1100, 1100)
Obvious Strategy convert each vector iterator
into a loop
DO J 1, 100, 1
DO I 1, 100
A(I,J) 2.0 A(I,J)
ENDDO
ENDDO

26
Multidimensional Scalarization

What should the order of the loops be after
scalarization?
Familiar question We dealt with this issue in
Loop Selection/Interchange in Chapter 5
Profitability of a particular configuration
depends on target architecture
For simplicity, we shall assume shorter strides
through memory are better
Thus, optimal choice for innermost loop is the
leftmost vector iterator

27
Multidimensional Scalarization

Extending previous results to multiple
dimensions
Each vector iterator is scalarized separately,
starting from the leftmost vector iterator in the
innermost loop and the rest of the iterators from
left to right
Once the ordering is available
1. Test to see if the loop carries a
scalarization dependence. If not, then proceed to
the next loop.
If the scalarization loop carries only true
dependences, reverse the loop and proceed to the
next loop.
Apply input prefetching, with loop splitting
where appropriate, to eliminate dependences to
which it applies. Observe, however, that in outer
loops, prefetching is done for a single submatrix
(the remaining dimensions).
4. Otherwise, the loop carries a scalarization
fault that requires temporary storage. Generate a
scalarization that utilizes temporary storage and
terminate the scalarization test for this loop,
since temporary storage will eliminate all
scalarization faults.

28
Outer Loop Prefetching

A(1N, 1N)
(A(0N-1, 2N1) A(2N1, 0N-1)) / 2.0
If we try to scalarize this (keeping the column
iterator in the innermost loop) we get a true
scalarization dependence (lt, gt) involving the
second input and an antidependence (gt, lt)
involving the first input
Cannot use loop reversal...

29
Outer Loop Prefetching

A(1N, 1N)
(A(0N-1, 2N1) A(2N1, 0N-1)) / 2.0
We can use input prefetching on the outer loop.
The temporaries will be arrays
T0(1N) A(2N1, 0)
DO j 1, N-1
T1(1N)( A(0N-1, j1) T0(1N) ) / 2.0
T0(1N) A(2N1, j)
A(1N, j) T1(1N)
ENDDO
T1(1N) ( A(0N-1, N) T0(1N) ) / 2.0
A(1N, N) T1(1N)
Total temporary space required 2 rows of
original matrix
Better than storage required for copy of the
result matrix

30
Loop Interchange

Sometimes, there is a tradeoff between
scalarization and optimal memory hierarchy usage
A(2100, 3101) A(3101, 12012)
If we scalarize this using the prescribed order
DO I 3, 101
DO 100 J 2, 100
A(J,I) A(J1,2I-5)
ENDDO
ENDDO
Dependences (lt, gt) (I 3, 4) and (gt, gt) (I 6,
7)
Cannot use loop reversal, input prefetching
Can use temporaries

31
Loop Interchange

However, we can use loop interchange to get
DO J 2, 100
DO I 3, 101
A(J,I) A(J1,2I-5)
ENDDO
ENDDO
Not optimal memory hierarchy usage, but reduction
of temporary storage
Loop interchange is useful to reduce size of
temporaries
It can also eliminate scalarization dependences

32
General Multidimensional Scalarization

Goal To vectorize a single statement which has m
vector dimensions
Given an ideal order of scalarization (l1, l2,
..., lm)
(d1, d2, ..., dn) be direction vectors for all
true and antidependences of the statement upon
itself
The scalarization matrix is a n ? m matrix of
these direction vectors
For instance
A(1N, 1N, 1N) A(0N-1, 1N, 2N1)
A(1N, 2N1, 0N-1)
gt lt
lt gt

33
General Multidimensional Scalarization

If we examine any column of the direction matrix,
we can immediately see if the corresponding loop
can be safely scalarized as the outermost loop of
the nest
If all entries of the column are or gt, it can
be safely scalarized as the outermost loop
without loop reversal.
If all entries are or lt, it can be safely
scalarized with loop reversal.
If it contains a mixture of lt and gt, it cannot be
scalarized by simple means.

34
General Multidimensional Scalarization

Once a loop has been selected for scalarization,
the dependences carried by that loop, any
dependence whose direction vector does not
contain a in the position corresponding to the
selected loop may be eliminated from further
consideration.
In our example, if we move the second column to
the outside, we get
gt lt gt lt
lt gt gt lt
Scalarization in this way will reduce the matrix
to
lt gt

35
A Complete Scalarization Algorithm

procedure CompleteScalarize(S, loop_list)
let M be the scalarization direction matrix
resulting from scalarization S to loop_list
while there are more loops to be scalarized do
begin
let l be the first loop in loop list that can
be simply scalarized with or without loop
reversal (determine by examining the columns of M
from left to right)
if there is no such l then begin
let l be the first loop on loop_list
section l by input prefetching
if the previous step fails then
section S using the naive temporary method
and exit
end

36
A Complete Scalarization Algorithm

else // make l the outermost loop
section l directly or with loop reversal,
depending on the entries in the column of M
corresponding to l (if l is the last loop, use
hardware section length)
remove l from loop_list
let M' be M with the column corresponding to l
and the rows corresponding to non entries
in that column eliminated
M M'
end //while
end CompleteScalarize
Time Complexity O(m2 n) where
m number of loops
n number of dependences

37
A Complete Scalarization Algorithm

Correctness follows from the definition of the
scalarization matrix and the method to remove
dependences from the matrix
For a given statement S and loop list, Scalarize
produces a correct scalarization with the
following properties
input prefetching is applied to the innermost
loop possible
the order of scalarization loops is the closest
possible to the order specified on input among
scalarizations with property (1).

38
Scalarization Example

DO J 2, N-1
A(2N-1,J) A(1N-2,J) A(3N,J)
A(2N-1,J-1) A(2N-1,J1)/4.
ENDDO
Loop carried true dependence, antidependence
Naive compiler could generate
DO J 2, N-1
DO i 2, N-1
T(i-1) (A(i-1,J) A(i1,J)
A(i,J-1) A(i,J1) )/4
ENDDO
DO i 2, N-1
A(i,J) T(i-1)
ENDDO
ENDDO
2 ? (N-2)2 accesses to memory due to array T

39
Scalarization Example

However, can use input prefetching to get
DO J 2, N-1
tA0 A(1, J)
DO i 2, N-2
tA1 (tA0A(i1,J)A(i,J-1)A(i,J1))/4
tA0 A(i-1, J)
A(i,J) tA1
ENDDO
tA1 (tA0A(N,J)A(N-1,J-1)A(N-1,J1))/4
A(N-1,J) tA1
ENDDO
If temporaries are allocated to registers, no
more memory accesses than original Fortran 90
program

40
Post Scalarization Issues

Issues due to scalarization
Generates many individual loops
These loops carry no dependences. So reuse of
quantities in registers is not common
Solution Use loop interchange, loop fusion,
unroll-and-jam, and scalar replacement

Write a Comment

User Comments (0)