Title: Loop optimizations and parallelizing compilers
1Loop optimizations and parallelizing compilers
2Why loops
- 90 of execution time in 10 of the code
- Mostly in loops
- Relatively easy to analyze
- Loop optimizations
- Different ways to transform loops with the same
semantics - Objective?
- Single-thread system mostly optimizing for
memory hierarchy. - Multi-thread system loop parallelization
- Parallelizing compiler automatically finds the
loops that can be executed in parallel.
3Loop optimization scalar replacement of array
elements
For (i0 iltN i) for(j0 jltN j)
for (k0 kltN k) c(I, j) c(I,
j) a(I, k) b(k, j)
Registers are almost never allocated to array
elements. Why? Scalar replacement Allows
registers to be allocated to the scalar, which
reduces memory reference. Also known as register
pipelining.
For (i0 iltN i) for(j0 jltN j)
ct c(I, j) for (k0 kltN k)
ct ct a(I, k) b(k, j) c(I,
j) ct
4Loop normalization
For (ia iltb i c)
For (ii1 iilt??? ii) I a (ii-1)
b
Loop normalization does not do too much by
itself. But it makes the iteration space much
easy to manipulate, which enables other
optimizations.
5Loop transformations
- Change the shape of loop iterations
- Change the access pattern
- Increase data reuse (locality)
- Reduce overheads
- Valid transformations need to maintain the
dependence. - If (i1, i2, i3, in) depends on (j1, j2, , jn),
then - (j1, j2, , jn) needs to happen before
(i1, i2, , in) in a valid transformation.
6Loop transformations
- Unimodular transformations
- Loop interchange, loop permutation, loop
reversal, loop skewing, and many others - Loop fusion and distribution
- Loop tiling
- Loop unrolling
7Unimodular transformations
- A unimodular matrix is a square matrix with all
integral components and with a determinant of 1
or 1. - Let the unimodular matrix be U, it transforms
iteration I (i1, i2, , in) to iteration U I. - Applicability (proven by Michael Wolf)
- A unimodular transformation represented by matrix
U is legal when applied to a loop nest with a set
of distance vector D if and only if for each d in
D, Ud gt 0.
8Unimodular transformations example loop
interchange
For (I0 Iltn I) for (j0 j lt n j)
a(I,j) a(I-1, j) 1
For (j0 jltn j) for (i0 i lt n i)
a(i,j) a(i-1, j) 1
Why is this transformation valid?
9Unimodular transformations example loop
permutation
For (I0 Iltn I) for (j0 j lt n j)
for (k0 k lt n k) for (l0
lltn l)
10Unimodular transformations example loop reversal
For (I0 Iltn I) for (j0 j lt n j)
a(I,j) a(I-1, j) 1.0
For (I0 Iltn I) for (jn-1 j gt0 j--)
a(I,j) a(I-1, j) 1.0
11Unimodular transformations example loop skewing
For (I0 Iltn I) for (j0 j lt n j)
a(I) a(I j) 1.0
For (I0 Iltn I) for (jI1 j ltin j)
a(i) a(j) 1.0
12Loop fusion
- Takes two adjacent loops that have the same
iteration space and combines the body. - Legal when there are no flow, anti- and output
dependences in the fused loop. - Why
- Increase the loop body, reduce loop overheads
- Increase the chance of instruction scheduling
- May improve locality
For (I0 Iltn I) a(I) 1.0 For (j0
jltn j) b(j) 1.0
For (I0 Iltn I) a(I) 1.0 b(i)
1.0
13Loop distribution
- Takes one loop and partition it into two loops.
- Legal when no dependence loop is broken.
- Why
- Reduce memory trace
- Improve locality
- Increase the chance of instruction scheduling
For (I0 Iltn I) a(I) 1.0 b(i)
a(I)
For (I0 Iltn I) a(I) 1.0 For (j0
jltn j) b(j) a(I)
14Loop tiling
- Replaceing a single loop into two loops.
- for(I0 Iltn I) ? for(I0 Iltn It) for
(iiI, ii lt min(It,n) ii) - T is call tile size
- N-deep nest can be changed into n1-deep to
2n-deep nest.
For (i0 iltn i) for (j0 jltn j)
for (k0 jltn k)
For (i0 iltn it) for (iiI iiltmin(it,
n) ii) for (j0 jltn jt)
for (jjj jj lt min(jt, n) jj)
for (k0 jltn kt) for (kk k
kkltmin(kt, n) kk)
15Loop tiling
- When using with loop interchange, loop tiling
create inner loops with smaller memory trace
great for locality. - Loop tiling is one of the most important
techniques to optimize for locality - Reduce the size of the working set and change the
memory reference pattern.
For (i0 iltn it) for (iiI iiltmin(it,
n) ii) for (j0 jltn jt)
for (jjj jj lt min(jt, n) jj)
for (k0 jltn kt) for (kk k
kkltmin(kt, n) kk)
For (i0 iltn it) for (j0 jltn jt)
for (k0 kltn kt) for (iiI
iiltmin(it, n) ii) for (jjj jj
lt min(jt, n) jj) for (kk k
kkltmin(kt, n) kk)
Inner loop with much smaller memory footprint
16Loop unrolling
For (I0 Ilt100 I) a(I) 1.0
For (I0 Ilt100 I4) a(I) 1.0
a(I1) 1.0 a(I2) 1.0 a(I3)
1.0
- Reduce control overheads.
- Increase chance for instruction scheduling.
- Large body may require more resources
(register). -
- This can be very effective!!!!
17Loop optimization in action
- Optimizing matrix multiply
- For (i1 iltN i)
- for (j1 jltN j)
- for(k1 kltN k)
- c(I, j) c(I, j) A(I, k)B(k, j)
- Where should we focus on the optimization?
- Innermost loop.
- Memory references c(I, j), A(I, 1..N), B(1..N,
j) - Spatial locality memory reference stride 1 is
the best - Temporal locality hard to reuse cache data since
the memory trace is too large.
18Loop optimization in action
- Initial improvement increase spatial locality
in the inner loop, references to both A and B
have a stride 1. - Transpose A before go into this operation
(assuming column-major storage).
- Transpose A / for all I, j, A(I, j) A(j, i)
/ - For (i1 iltN i)
- for (j1 jltN j)
- for(k1 kltN k)
- c(I, j) c(I, j) A(k, I)B(k, j)
19Loop optimization in action
- C(i, j) are repeatedly referenced in the inner
loop scalar replacement
- Transpose A
- For (i1 iltN i)
- for (j1 jltN j)
- t c(I, j)
- for(k1 kltN k)
- t t A(k, I)B(k, j)
- c(I, j) t
-
- Transpose A
- For (i1 iltN i)
- for (j1 jltN j)
- for(k1 kltN k)
- c(I, j) c(I, j) A(k, I)B(k, j)
20Loop optimization in action
- Inner loops memory footprint is too large
- Loop tiling loop interchange
- for (j1 jltN jt)
- for(k1 kltN kt)
- for(I1 iltN it)
- for (iiI iiltmin(It-1, N)
ii) - for (jj j
jjltmin(jt-1,N)jj) - t c(ii, jj)
- for(kkk kk
ltmin(kt-1, N) kk) - t t A(kk,
ii)B(kk, jj) - c(ii, jj) t
-
Best when the inner Loops fit in the L1 cache.
21Loop optimization in action
- Loop unrolling
- for (j1 jltN jt)
- for(k1 kltN kt)
- for(I1 iltN it)
- for (iiI iiltmin(It-1, N)
ii) - for (jj j
jjltmin(jt-1,N)jj) - t c(ii, jj)
- t t A(kk, ii)
B(kk, jj) - t t A(kk1, ii)
B(kk1, jj) -
- t t A(kk15, ii)
B(kk 15, jj) - c(ii, jj) t
-
This assumes the loop can be nicely unrolled, you
need to take care of the boundary condition. The
code is not complete!!!
22Loop optimization in action
- Instruction scheduling
- would have to wait on the results of in a
typical processor. - is often deeply pipelined feed the pipeline
with many operation. - for (j1 jltN jt)
- for(k1 kltN kt)
- for(I1 iltN it)
- for (iiI iiltmin(It-1, N)
ii) - for (jj j
jjltmin(jt-1,N)jj) - t0 A(kk, ii) B(kk,
jj) - t1 A(kk1, ii)
B(kk1, jj) -
- t15 A(kk15, ii)
B(kk 15, jj) - c(ii, jj) c(ii, jj)
t0 t1 t15 -
23Loop optimization in action
- Further locality improve block order storage of
A, B, and C. - for (j1 jltN jt)
- for(k1 kltN kt)
- for(I1 iltN it)
- for (iiI iiltmin(It-1, N)
ii) - for (jj j
jjltmin(jt-1,N)jj) - t0 A(kk, ii) B(kk,
jj) - t1 A(kk1, ii)
B(kk1, jj) -
- t15 A(kk15, ii)
B(kk 15, jj) - c(ii, jj) c(ii, jj)
t0 t1 t15 -
24Loop optimization in action
- See the ATLAS paper for the complete story
- C. Whaley, et. al, "Automated Empirical
Optimization of Software and the ATLAS Project,"
Parallel Computing, 27(1-2)3-35, 2001.
25Parallelizing compilers
- Some well known parallelizing compilers
- Vienna Fortran (Vienna)
- Paradigm (UIUC)
- Polaris (UIUC)
- SUIF (Stanford)
- What do they do?
- Given a sequential program, identify all for
loops that can be legally executed in parallel. - Two kinds of parallel for loops
- Forall loops that do not required any
synchronization. - foracross loops that have some synchronizations.
26Parallelizing compiler
For (I0 Ilt5 I) a(I1) a(I) 1
For (I0 Ilt5 I) a(I) a(I6) 1
For (I0 Ilt5 I) a(2I) a(2I1) 1
27Deciding whether parallel execution will cause
problem?
- Summarize the execution order of the iterations
in a sequential run. - Summarize the order of data access in sequential
execution. - Determine if there are dependence between
iterations. - Parallelizable when there are no dependence.
28Type of Loops that can be analyzed
- Not all loops are easy to analyze.
- Current techniques are limited on affine for
loops. - Loop bounds are integer linear functions of
constants and outer loops bounds - Array indexes are also linear functions.
- Many techniques also assume indexes in different
dimensions are independent. - Examples of hard to handle references
- a(II), a(I, I).
29Iteration space
- N deep loops ? n-dimensional discrete Cartesian
space - Assume loops are all normalized.
- Iterations are represented as coordinates in the
iteration space (i1, i2, i3, , in).
30Lexicographic order
- Iterations are represented as coordinates in the
iteration space (i1, i2, i3, , in). - For affine loops
- The space can be represented by a set of linear
inequalities. - Sequential execution order of iterations is the
lexicographic order. - In each iterations, what are the array elements
referenced? - Are there any dependences between iterations.
31Array access in a loop
32Array access in a loop
33Distance vectors
- A Loop has a distance d if there exists a data
dependence from iteration I to j and d j-I. - For multiple loop nest, the distance is
represented by a vector. - Dependence analysis usually finds the exact
value, but not always.
34Dependence testing
- This is in the heart of all parallelizing
compiler. Basic function given two array
references in a loops, is it possible for both
references refer to the same memory location?
For (i1 ) for (i2 ) for
(in .) . A (f1(i1,i2, in),
f2(), , fk()) A(g1(), g2
(), , gk())
Is it possible for the two array references refer
the same memory in the loop?
35Dependence testing
For (i1 ) for (i2 ) for
(in .) . A (f1(i1,i2, in),
f2(), , fk()) A(g1(), g2
(), , gk())
f1 .. fk, g1, , gk are linear functions, trying
solve the linear equations f1 g1 f2
g2 fk gk Under the constraints
1lti1ltb1, 1lti2 ltb2, , 1ltin ltbn Can easily
formulated into an integer linear programming
(ILP) problem. Fourier-Motzkin/Omega test (Omega
library). The most thorough dependence test for
linear cases.
36Some simpler dependence tests
- GCD tests
- X, a0a1i1anin,
- X, b0 b1i1 bnin,
- Solve the equation
- a0a1i1 anin b0 b1i1 bnin
- Simplifying to c1i1 c2i2 cnin c0
- This equation only has solution if GCD (c1, c2,
, cn) can evenly divide c0.
37Some simpler dependence tests
- Banerjee bounds test compute the lower bound and
upper bound of the indexes and check if there are
overlaps. - Some tests for non-linear cases have also been
developed.