Loop optimizations and parallelizing compilers - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Loop optimizations and parallelizing compilers

Description:

N-deep nest can be changed into n 1-deep to 2n-deep nest. For (i=0; i n; i ) ... For multiple loop nest, the distance is represented by a vector. ... – PowerPoint PPT presentation

Number of Views:310

Avg rating:3.0/5.0

Slides: 38

Provided by: xiny2

Category:

more less

Transcript and Presenter's Notes

Title: Loop optimizations and parallelizing compilers

1
Loop optimizations and parallelizing compilers
2
Why loops

90 of execution time in 10 of the code
Mostly in loops
Relatively easy to analyze
Loop optimizations
Different ways to transform loops with the same
semantics
Objective?
Single-thread system mostly optimizing for
memory hierarchy.
Multi-thread system loop parallelization
Parallelizing compiler automatically finds the
loops that can be executed in parallel.

3
Loop optimization scalar replacement of array
elements
For (i0 iltN i) for(j0 jltN j)
for (k0 kltN k) c(I, j) c(I,
j) a(I, k) b(k, j)
Registers are almost never allocated to array
elements. Why? Scalar replacement Allows
registers to be allocated to the scalar, which
reduces memory reference. Also known as register
pipelining.
For (i0 iltN i) for(j0 jltN j)
ct c(I, j) for (k0 kltN k)
ct ct a(I, k) b(k, j) c(I,
j) ct
4
Loop normalization
For (ia iltb i c)
For (ii1 iilt??? ii) I a (ii-1)
b
Loop normalization does not do too much by
itself. But it makes the iteration space much
easy to manipulate, which enables other
optimizations.
5
Loop transformations

Change the shape of loop iterations
Change the access pattern
Increase data reuse (locality)
Reduce overheads
Valid transformations need to maintain the
dependence.
If (i1, i2, i3, in) depends on (j1, j2, , jn),
then
(j1, j2, , jn) needs to happen before
(i1, i2, , in) in a valid transformation.

6
Loop transformations

Unimodular transformations
Loop interchange, loop permutation, loop
reversal, loop skewing, and many others
Loop fusion and distribution
Loop tiling
Loop unrolling

7
Unimodular transformations

A unimodular matrix is a square matrix with all
integral components and with a determinant of 1
or 1.
Let the unimodular matrix be U, it transforms
iteration I (i1, i2, , in) to iteration U I.
Applicability (proven by Michael Wolf)
A unimodular transformation represented by matrix
U is legal when applied to a loop nest with a set
of distance vector D if and only if for each d in
D, Ud gt 0.

8
Unimodular transformations example loop
interchange
For (I0 Iltn I) for (j0 j lt n j)
a(I,j) a(I-1, j) 1
For (j0 jltn j) for (i0 i lt n i)
a(i,j) a(i-1, j) 1
Why is this transformation valid?
9
Unimodular transformations example loop
permutation
For (I0 Iltn I) for (j0 j lt n j)
for (k0 k lt n k) for (l0
lltn l)
10
Unimodular transformations example loop reversal
For (I0 Iltn I) for (j0 j lt n j)
a(I,j) a(I-1, j) 1.0
For (I0 Iltn I) for (jn-1 j gt0 j--)
a(I,j) a(I-1, j) 1.0
11
Unimodular transformations example loop skewing
For (I0 Iltn I) for (j0 j lt n j)
a(I) a(I j) 1.0
For (I0 Iltn I) for (jI1 j ltin j)
a(i) a(j) 1.0
12
Loop fusion

Takes two adjacent loops that have the same
iteration space and combines the body.
Legal when there are no flow, anti- and output
dependences in the fused loop.
Why
Increase the loop body, reduce loop overheads
Increase the chance of instruction scheduling
May improve locality

For (I0 Iltn I) a(I) 1.0 For (j0
jltn j) b(j) 1.0
For (I0 Iltn I) a(I) 1.0 b(i)
1.0
13
Loop distribution

Takes one loop and partition it into two loops.
Legal when no dependence loop is broken.
Why
Reduce memory trace
Improve locality
Increase the chance of instruction scheduling

For (I0 Iltn I) a(I) 1.0 b(i)
a(I)
For (I0 Iltn I) a(I) 1.0 For (j0
jltn j) b(j) a(I)
14
Loop tiling

Replaceing a single loop into two loops.
for(I0 Iltn I) ? for(I0 Iltn It) for
(iiI, ii lt min(It,n) ii)
T is call tile size
N-deep nest can be changed into n1-deep to
2n-deep nest.

For (i0 iltn i) for (j0 jltn j)
for (k0 jltn k)
For (i0 iltn it) for (iiI iiltmin(it,
n) ii) for (j0 jltn jt)
for (jjj jj lt min(jt, n) jj)
for (k0 jltn kt) for (kk k
kkltmin(kt, n) kk)
15
Loop tiling

When using with loop interchange, loop tiling
create inner loops with smaller memory trace
great for locality.
Loop tiling is one of the most important
techniques to optimize for locality
Reduce the size of the working set and change the
memory reference pattern.

For (i0 iltn it) for (iiI iiltmin(it,
n) ii) for (j0 jltn jt)
for (jjj jj lt min(jt, n) jj)
for (k0 jltn kt) for (kk k
kkltmin(kt, n) kk)
For (i0 iltn it) for (j0 jltn jt)
for (k0 kltn kt) for (iiI
iiltmin(it, n) ii) for (jjj jj
lt min(jt, n) jj) for (kk k
kkltmin(kt, n) kk)
Inner loop with much smaller memory footprint
16
Loop unrolling
For (I0 Ilt100 I) a(I) 1.0
For (I0 Ilt100 I4) a(I) 1.0
a(I1) 1.0 a(I2) 1.0 a(I3)
1.0

Reduce control overheads.
Increase chance for instruction scheduling.
Large body may require more resources
(register).
This can be very effective!!!!

17
Loop optimization in action

Optimizing matrix multiply
For (i1 iltN i)
for (j1 jltN j)
for(k1 kltN k)
c(I, j) c(I, j) A(I, k)B(k, j)
Where should we focus on the optimization?
Innermost loop.
Memory references c(I, j), A(I, 1..N), B(1..N,
j)
Spatial locality memory reference stride 1 is
the best
Temporal locality hard to reuse cache data since
the memory trace is too large.

18
Loop optimization in action

Initial improvement increase spatial locality
in the inner loop, references to both A and B
have a stride 1.
Transpose A before go into this operation
(assuming column-major storage).

Transpose A / for all I, j, A(I, j) A(j, i)
/
For (i1 iltN i)
for (j1 jltN j)
for(k1 kltN k)
c(I, j) c(I, j) A(k, I)B(k, j)

19
Loop optimization in action

C(i, j) are repeatedly referenced in the inner
loop scalar replacement

Transpose A
For (i1 iltN i)
for (j1 jltN j)
t c(I, j)
for(k1 kltN k)
t t A(k, I)B(k, j)
c(I, j) t

Transpose A
For (i1 iltN i)
for (j1 jltN j)
for(k1 kltN k)
c(I, j) c(I, j) A(k, I)B(k, j)

20
Loop optimization in action

Inner loops memory footprint is too large
Loop tiling loop interchange
for (j1 jltN jt)
for(k1 kltN kt)
for(I1 iltN it)
for (iiI iiltmin(It-1, N)
ii)
for (jj j
jjltmin(jt-1,N)jj)
t c(ii, jj)
for(kkk kk
ltmin(kt-1, N) kk)
t t A(kk,
ii)B(kk, jj)
c(ii, jj) t

Best when the inner Loops fit in the L1 cache.
21
Loop optimization in action

Loop unrolling
for (j1 jltN jt)
for(k1 kltN kt)
for(I1 iltN it)
for (iiI iiltmin(It-1, N)
ii)
for (jj j
jjltmin(jt-1,N)jj)
t c(ii, jj)
t t A(kk, ii)
B(kk, jj)
t t A(kk1, ii)
B(kk1, jj)
t t A(kk15, ii)
B(kk 15, jj)
c(ii, jj) t

This assumes the loop can be nicely unrolled, you
need to take care of the boundary condition. The
code is not complete!!!
22
Loop optimization in action

Instruction scheduling
would have to wait on the results of in a
typical processor.
is often deeply pipelined feed the pipeline
with many operation.
for (j1 jltN jt)
for(k1 kltN kt)
for(I1 iltN it)
for (iiI iiltmin(It-1, N)
ii)
for (jj j
jjltmin(jt-1,N)jj)
t0 A(kk, ii) B(kk,
jj)
t1 A(kk1, ii)
B(kk1, jj)
t15 A(kk15, ii)
B(kk 15, jj)
c(ii, jj) c(ii, jj)
t0 t1 t15

23
Loop optimization in action

Further locality improve block order storage of
A, B, and C.
for (j1 jltN jt)
for(k1 kltN kt)
for(I1 iltN it)
for (iiI iiltmin(It-1, N)
ii)
for (jj j
jjltmin(jt-1,N)jj)
t0 A(kk, ii) B(kk,
jj)
t1 A(kk1, ii)
B(kk1, jj)
t15 A(kk15, ii)
B(kk 15, jj)
c(ii, jj) c(ii, jj)
t0 t1 t15

24
Loop optimization in action

See the ATLAS paper for the complete story
C. Whaley, et. al, "Automated Empirical
Optimization of Software and the ATLAS Project,"
Parallel Computing, 27(1-2)3-35, 2001.

25
Parallelizing compilers

Some well known parallelizing compilers
Vienna Fortran (Vienna)
Paradigm (UIUC)
Polaris (UIUC)
SUIF (Stanford)
What do they do?
Given a sequential program, identify all for
loops that can be legally executed in parallel.
Two kinds of parallel for loops
Forall loops that do not required any
synchronization.
foracross loops that have some synchronizations.

26
Parallelizing compiler

For loop examples

For (I0 Ilt5 I) a(I1) a(I) 1
For (I0 Ilt5 I) a(I) a(I6) 1
For (I0 Ilt5 I) a(2I) a(2I1) 1
27
Deciding whether parallel execution will cause
problem?

Summarize the execution order of the iterations
in a sequential run.
Summarize the order of data access in sequential
execution.
Determine if there are dependence between
iterations.
Parallelizable when there are no dependence.

28
Type of Loops that can be analyzed

Not all loops are easy to analyze.
Current techniques are limited on affine for
loops.
Loop bounds are integer linear functions of
constants and outer loops bounds
Array indexes are also linear functions.
Many techniques also assume indexes in different
dimensions are independent.
Examples of hard to handle references
a(II), a(I, I).

29
Iteration space

N deep loops ? n-dimensional discrete Cartesian
space
Assume loops are all normalized.
Iterations are represented as coordinates in the
iteration space (i1, i2, i3, , in).

30
Lexicographic order

Iterations are represented as coordinates in the
iteration space (i1, i2, i3, , in).
For affine loops
The space can be represented by a set of linear
inequalities.
Sequential execution order of iterations is the
lexicographic order.
In each iterations, what are the array elements
referenced?
Are there any dependences between iterations.

31
Array access in a loop
32
Array access in a loop
33
Distance vectors

A Loop has a distance d if there exists a data
dependence from iteration I to j and d j-I.
For multiple loop nest, the distance is
represented by a vector.
Dependence analysis usually finds the exact
value, but not always.

34
Dependence testing

This is in the heart of all parallelizing
compiler. Basic function given two array
references in a loops, is it possible for both
references refer to the same memory location?

For (i1 ) for (i2 ) for
(in .) . A (f1(i1,i2, in),
f2(), , fk()) A(g1(), g2
(), , gk())
Is it possible for the two array references refer
the same memory in the loop?
35
Dependence testing
For (i1 ) for (i2 ) for
(in .) . A (f1(i1,i2, in),
f2(), , fk()) A(g1(), g2
(), , gk())
f1 .. fk, g1, , gk are linear functions, trying
solve the linear equations f1 g1 f2
g2 fk gk Under the constraints
1lti1ltb1, 1lti2 ltb2, , 1ltin ltbn Can easily
formulated into an integer linear programming
(ILP) problem. Fourier-Motzkin/Omega test (Omega
library). The most thorough dependence test for
linear cases.
36
Some simpler dependence tests

GCD tests
X, a0a1i1anin,
X, b0 b1i1 bnin,
Solve the equation
a0a1i1 anin b0 b1i1 bnin
Simplifying to c1i1 c2i2 cnin c0
This equation only has solution if GCD (c1, c2,
, cn) can evenly divide c0.

37
Some simpler dependence tests

Banerjee bounds test compute the lower bound and
upper bound of the indexes and check if there are
overlaps.
Some tests for non-linear cases have also been
developed.

Write a Comment

User Comments (0)