Loop optimizations and parallelizing compilers - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Loop optimizations and parallelizing compilers

Description:

N-deep nest can be changed into n 1-deep to 2n-deep nest. For (i=0; i n; i ) ... For multiple loop nest, the distance is represented by a vector. ... – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 38
Provided by: xiny2
Category:

less

Transcript and Presenter's Notes

Title: Loop optimizations and parallelizing compilers


1
Loop optimizations and parallelizing compilers
2
Why loops
  • 90 of execution time in 10 of the code
  • Mostly in loops
  • Relatively easy to analyze
  • Loop optimizations
  • Different ways to transform loops with the same
    semantics
  • Objective?
  • Single-thread system mostly optimizing for
    memory hierarchy.
  • Multi-thread system loop parallelization
  • Parallelizing compiler automatically finds the
    loops that can be executed in parallel.

3
Loop optimization scalar replacement of array
elements
For (i0 iltN i) for(j0 jltN j)
for (k0 kltN k) c(I, j) c(I,
j) a(I, k) b(k, j)
Registers are almost never allocated to array
elements. Why? Scalar replacement Allows
registers to be allocated to the scalar, which
reduces memory reference. Also known as register
pipelining.
For (i0 iltN i) for(j0 jltN j)
ct c(I, j) for (k0 kltN k)
ct ct a(I, k) b(k, j) c(I,
j) ct
4
Loop normalization
For (ia iltb i c)
For (ii1 iilt??? ii) I a (ii-1)
b
Loop normalization does not do too much by
itself. But it makes the iteration space much
easy to manipulate, which enables other
optimizations.
5
Loop transformations
  • Change the shape of loop iterations
  • Change the access pattern
  • Increase data reuse (locality)
  • Reduce overheads
  • Valid transformations need to maintain the
    dependence.
  • If (i1, i2, i3, in) depends on (j1, j2, , jn),
    then
  • (j1, j2, , jn) needs to happen before
    (i1, i2, , in) in a valid transformation.

6
Loop transformations
  • Unimodular transformations
  • Loop interchange, loop permutation, loop
    reversal, loop skewing, and many others
  • Loop fusion and distribution
  • Loop tiling
  • Loop unrolling

7
Unimodular transformations
  • A unimodular matrix is a square matrix with all
    integral components and with a determinant of 1
    or 1.
  • Let the unimodular matrix be U, it transforms
    iteration I (i1, i2, , in) to iteration U I.
  • Applicability (proven by Michael Wolf)
  • A unimodular transformation represented by matrix
    U is legal when applied to a loop nest with a set
    of distance vector D if and only if for each d in
    D, Ud gt 0.

8
Unimodular transformations example loop
interchange
For (I0 Iltn I) for (j0 j lt n j)
a(I,j) a(I-1, j) 1
For (j0 jltn j) for (i0 i lt n i)
a(i,j) a(i-1, j) 1
Why is this transformation valid?
9
Unimodular transformations example loop
permutation
For (I0 Iltn I) for (j0 j lt n j)
for (k0 k lt n k) for (l0
lltn l)
10
Unimodular transformations example loop reversal
For (I0 Iltn I) for (j0 j lt n j)
a(I,j) a(I-1, j) 1.0
For (I0 Iltn I) for (jn-1 j gt0 j--)
a(I,j) a(I-1, j) 1.0
11
Unimodular transformations example loop skewing
For (I0 Iltn I) for (j0 j lt n j)
a(I) a(I j) 1.0
For (I0 Iltn I) for (jI1 j ltin j)
a(i) a(j) 1.0
12
Loop fusion
  • Takes two adjacent loops that have the same
    iteration space and combines the body.
  • Legal when there are no flow, anti- and output
    dependences in the fused loop.
  • Why
  • Increase the loop body, reduce loop overheads
  • Increase the chance of instruction scheduling
  • May improve locality

For (I0 Iltn I) a(I) 1.0 For (j0
jltn j) b(j) 1.0
For (I0 Iltn I) a(I) 1.0 b(i)
1.0
13
Loop distribution
  • Takes one loop and partition it into two loops.
  • Legal when no dependence loop is broken.
  • Why
  • Reduce memory trace
  • Improve locality
  • Increase the chance of instruction scheduling

For (I0 Iltn I) a(I) 1.0 b(i)
a(I)
For (I0 Iltn I) a(I) 1.0 For (j0
jltn j) b(j) a(I)
14
Loop tiling
  • Replaceing a single loop into two loops.
  • for(I0 Iltn I) ? for(I0 Iltn It) for
    (iiI, ii lt min(It,n) ii)
  • T is call tile size
  • N-deep nest can be changed into n1-deep to
    2n-deep nest.

For (i0 iltn i) for (j0 jltn j)
for (k0 jltn k)
For (i0 iltn it) for (iiI iiltmin(it,
n) ii) for (j0 jltn jt)
for (jjj jj lt min(jt, n) jj)
for (k0 jltn kt) for (kk k
kkltmin(kt, n) kk)
15
Loop tiling
  • When using with loop interchange, loop tiling
    create inner loops with smaller memory trace
    great for locality.
  • Loop tiling is one of the most important
    techniques to optimize for locality
  • Reduce the size of the working set and change the
    memory reference pattern.

For (i0 iltn it) for (iiI iiltmin(it,
n) ii) for (j0 jltn jt)
for (jjj jj lt min(jt, n) jj)
for (k0 jltn kt) for (kk k
kkltmin(kt, n) kk)
For (i0 iltn it) for (j0 jltn jt)
for (k0 kltn kt) for (iiI
iiltmin(it, n) ii) for (jjj jj
lt min(jt, n) jj) for (kk k
kkltmin(kt, n) kk)
Inner loop with much smaller memory footprint
16
Loop unrolling
For (I0 Ilt100 I) a(I) 1.0
For (I0 Ilt100 I4) a(I) 1.0
a(I1) 1.0 a(I2) 1.0 a(I3)
1.0
  • Reduce control overheads.
  • Increase chance for instruction scheduling.
  • Large body may require more resources
    (register).
  • This can be very effective!!!!

17
Loop optimization in action
  • Optimizing matrix multiply
  • For (i1 iltN i)
  • for (j1 jltN j)
  • for(k1 kltN k)
  • c(I, j) c(I, j) A(I, k)B(k, j)
  • Where should we focus on the optimization?
  • Innermost loop.
  • Memory references c(I, j), A(I, 1..N), B(1..N,
    j)
  • Spatial locality memory reference stride 1 is
    the best
  • Temporal locality hard to reuse cache data since
    the memory trace is too large.

18
Loop optimization in action
  • Initial improvement increase spatial locality
    in the inner loop, references to both A and B
    have a stride 1.
  • Transpose A before go into this operation
    (assuming column-major storage).
  • Transpose A / for all I, j, A(I, j) A(j, i)
    /
  • For (i1 iltN i)
  • for (j1 jltN j)
  • for(k1 kltN k)
  • c(I, j) c(I, j) A(k, I)B(k, j)

19
Loop optimization in action
  • C(i, j) are repeatedly referenced in the inner
    loop scalar replacement
  • Transpose A
  • For (i1 iltN i)
  • for (j1 jltN j)
  • t c(I, j)
  • for(k1 kltN k)
  • t t A(k, I)B(k, j)
  • c(I, j) t
  • Transpose A
  • For (i1 iltN i)
  • for (j1 jltN j)
  • for(k1 kltN k)
  • c(I, j) c(I, j) A(k, I)B(k, j)

20
Loop optimization in action
  • Inner loops memory footprint is too large
  • Loop tiling loop interchange
  • for (j1 jltN jt)
  • for(k1 kltN kt)
  • for(I1 iltN it)
  • for (iiI iiltmin(It-1, N)
    ii)
  • for (jj j
    jjltmin(jt-1,N)jj)
  • t c(ii, jj)
  • for(kkk kk
    ltmin(kt-1, N) kk)
  • t t A(kk,
    ii)B(kk, jj)
  • c(ii, jj) t

Best when the inner Loops fit in the L1 cache.
21
Loop optimization in action
  • Loop unrolling
  • for (j1 jltN jt)
  • for(k1 kltN kt)
  • for(I1 iltN it)
  • for (iiI iiltmin(It-1, N)
    ii)
  • for (jj j
    jjltmin(jt-1,N)jj)
  • t c(ii, jj)
  • t t A(kk, ii)
    B(kk, jj)
  • t t A(kk1, ii)
    B(kk1, jj)
  • t t A(kk15, ii)
    B(kk 15, jj)
  • c(ii, jj) t

This assumes the loop can be nicely unrolled, you
need to take care of the boundary condition. The
code is not complete!!!
22
Loop optimization in action
  • Instruction scheduling
  • would have to wait on the results of in a
    typical processor.
  • is often deeply pipelined feed the pipeline
    with many operation.
  • for (j1 jltN jt)
  • for(k1 kltN kt)
  • for(I1 iltN it)
  • for (iiI iiltmin(It-1, N)
    ii)
  • for (jj j
    jjltmin(jt-1,N)jj)
  • t0 A(kk, ii) B(kk,
    jj)
  • t1 A(kk1, ii)
    B(kk1, jj)
  • t15 A(kk15, ii)
    B(kk 15, jj)
  • c(ii, jj) c(ii, jj)
    t0 t1 t15

23
Loop optimization in action
  • Further locality improve block order storage of
    A, B, and C.
  • for (j1 jltN jt)
  • for(k1 kltN kt)
  • for(I1 iltN it)
  • for (iiI iiltmin(It-1, N)
    ii)
  • for (jj j
    jjltmin(jt-1,N)jj)
  • t0 A(kk, ii) B(kk,
    jj)
  • t1 A(kk1, ii)
    B(kk1, jj)
  • t15 A(kk15, ii)
    B(kk 15, jj)
  • c(ii, jj) c(ii, jj)
    t0 t1 t15

24
Loop optimization in action
  • See the ATLAS paper for the complete story
  • C. Whaley, et. al, "Automated Empirical
    Optimization of Software and the ATLAS Project,"
    Parallel Computing, 27(1-2)3-35, 2001.

25
Parallelizing compilers
  • Some well known parallelizing compilers
  • Vienna Fortran (Vienna)
  • Paradigm (UIUC)
  • Polaris (UIUC)
  • SUIF (Stanford)
  • What do they do?
  • Given a sequential program, identify all for
    loops that can be legally executed in parallel.
  • Two kinds of parallel for loops
  • Forall loops that do not required any
    synchronization.
  • foracross loops that have some synchronizations.

26
Parallelizing compiler
  • For loop examples

For (I0 Ilt5 I) a(I1) a(I) 1
For (I0 Ilt5 I) a(I) a(I6) 1
For (I0 Ilt5 I) a(2I) a(2I1) 1
27
Deciding whether parallel execution will cause
problem?
  • Summarize the execution order of the iterations
    in a sequential run.
  • Summarize the order of data access in sequential
    execution.
  • Determine if there are dependence between
    iterations.
  • Parallelizable when there are no dependence.

28
Type of Loops that can be analyzed
  • Not all loops are easy to analyze.
  • Current techniques are limited on affine for
    loops.
  • Loop bounds are integer linear functions of
    constants and outer loops bounds
  • Array indexes are also linear functions.
  • Many techniques also assume indexes in different
    dimensions are independent.
  • Examples of hard to handle references
  • a(II), a(I, I).

29
Iteration space
  • N deep loops ? n-dimensional discrete Cartesian
    space
  • Assume loops are all normalized.
  • Iterations are represented as coordinates in the
    iteration space (i1, i2, i3, , in).

30
Lexicographic order
  • Iterations are represented as coordinates in the
    iteration space (i1, i2, i3, , in).
  • For affine loops
  • The space can be represented by a set of linear
    inequalities.
  • Sequential execution order of iterations is the
    lexicographic order.
  • In each iterations, what are the array elements
    referenced?
  • Are there any dependences between iterations.

31
Array access in a loop
32
Array access in a loop
33
Distance vectors
  • A Loop has a distance d if there exists a data
    dependence from iteration I to j and d j-I.
  • For multiple loop nest, the distance is
    represented by a vector.
  • Dependence analysis usually finds the exact
    value, but not always.

34
Dependence testing
  • This is in the heart of all parallelizing
    compiler. Basic function given two array
    references in a loops, is it possible for both
    references refer to the same memory location?

For (i1 ) for (i2 ) for
(in .) . A (f1(i1,i2, in),
f2(), , fk()) A(g1(), g2
(), , gk())
Is it possible for the two array references refer
the same memory in the loop?
35
Dependence testing
For (i1 ) for (i2 ) for
(in .) . A (f1(i1,i2, in),
f2(), , fk()) A(g1(), g2
(), , gk())
f1 .. fk, g1, , gk are linear functions, trying
solve the linear equations f1 g1 f2
g2 fk gk Under the constraints
1lti1ltb1, 1lti2 ltb2, , 1ltin ltbn Can easily
formulated into an integer linear programming
(ILP) problem. Fourier-Motzkin/Omega test (Omega
library). The most thorough dependence test for
linear cases.
36
Some simpler dependence tests
  • GCD tests
  • X, a0a1i1anin,
  • X, b0 b1i1 bnin,
  • Solve the equation
  • a0a1i1 anin b0 b1i1 bnin
  • Simplifying to c1i1 c2i2 cnin c0
  • This equation only has solution if GCD (c1, c2,
    , cn) can evenly divide c0.

37
Some simpler dependence tests
  • Banerjee bounds test compute the lower bound and
    upper bound of the indexes and check if there are
    overlaps.
  • Some tests for non-linear cases have also been
    developed.
Write a Comment
User Comments (0)
About PowerShow.com