Allen and Kennedy, Chapter 13 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Allen and Kennedy, Chapter 13

Description:

Optimizing Compilers for Modern Architectures. Allen and Kennedy, Chapter 13 ... Optimizing Compilers for Modern Architectures. Fortran 90. Fortran 90: ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 41
Provided by: ans115
Category:

less

Transcript and Presenter's Notes

Title: Allen and Kennedy, Chapter 13


1
Compiling Array Assignments
  • Allen and Kennedy, Chapter 13

2
Fortran 90
  • Fortran 90 successor to Fortran 77
  • Slow to gain acceptance
  • Need better/smarter compiler techniques to
    achieve same level of performance as Fortran 77
    compilers
  • This chapter focuses on a single new feature -
    the array assignment statement A1100 2.0
  • Intended to provide direct mechanism to specify
    parallel/vector execution
  • This statement must be implemented for the
    specific available hardware. In an uniprocessor,
    the statement must be converted to a scalar loop
    Scalarization

3
Fortran 90
  • Range of a vector operation in Fortran 90 denoted
    by a triplet ltlower bound upper bound
    incrementgt
  • A11002 B251 3.0
  • Semantics of Fortran 90 require that for vector
    statements, all inputs to the statement are
    fetched before any results are stored

4
Outline
  • Simple scalarization
  • Safe scalarization
  • Techniques to improve on safe scalarization
  • Loop reversal
  • Input prefetching
  • Loop splitting
  • Multidimensional scalarization
  • A framework for analyzing multidimensional
    scalarization

5
Scalarization
  • Replace each array assignment by a corresponding
    DO loop
  • Is it really that easy?
  • Two key issues
  • Wish to avoid generating large array
    temporaries
  • Wish to optimize loops to exhibit good memory
    hierarchy
  • performance

6
Simple Scalarization
  • Consider the vector statement
  • A(1200) 2.0 A(1200)
  • A scalar implementation
  • S1 DO I 1, 200
  • S2 A(I) 2.0 A(I)
  • ENDDO
  • However, some statements cause problems
  • A(2201) 2.0 A(1200)
  • If we naively scalarize
  • DO i 1, 200
  • A(i1) 2.0 A(i)
  • ENDDO

7
Simple Scalarization
  • Meaning of statements changed by the
    scalarization process scalarization faults
  • Naive algorithm which ignores such scalarization
    faults
  • procedure SimpleScalarize(S)
  • let V0 (L0 U0 D0 ) be the vector
    iteration specifier on left side of S
  • // Generate the scalarizing loop
  • let I be a new loop index variable
  • generate the statement DO I L0 ,U0 ,D
  • for each vector specifier V (L U D) in S
    do
  • replace V with (IL-L0 )
  • generate an ENDDO statement
  • end SimpleScalarize

8
Scalarization Faults
  • Why do scalarization faults occur?
  • Vector operation semantics All values from the
    RHS of the assignment should be fetched before
    storing into the result
  • If a scalar operation stores into a location
    fetched by a later operation, we get a
    scalarization fault
  • Principle 13.1 A vector assignment generates a
    scalarization fault if and only if the scalarized
    loop carries a true dependence.
  • These dependences are known as scalarization
    dependences
  • To preserve correctness, compiler should never
    produce a scalarization dependence

9
Safe Scalarization
  • Naive algorithm for safe scalarization Use
    temporary storage to make sure scalarization
    dependences are not created
  • Consider
  • A(2201) 2.0 A(1200)
  • can be split up into
  • T(1200) 2.0 A(1200)
  • A(2201) T(1200)
  • Then scalarize using SimpleScalarize
  • DO I 1, 200
  • T(I) 2.0 A(I)
  • ENDDO
  • DO I 2, 201
  • A(I) T(I-1)
  • ENDDO

10
Safe Scalarization
  • Procedure SafeScalarize implements this method of
    scalarization
  • Good news
  • Scalarization always possible by using
    temporaries
  • Bad News
  • Substantial increase in memory use due to
    temporaries
  • More memory operations per array element
  • We shall look at a number of techniques to reduce
    the effects of these disadvantages

11
Loop Reversal
  • A(2256) A(1255) 1.0
  • SimpleScalarize will produce a scalarization
    fault
  • Solution Loop reversal
  • DO I 256, 2, -1
  • A(I) A(I1) 1.0
  • ENDDO

12
Loop Reversal
  • When can we use loop reversal?
  • Loop reversal maps dependences into
    antidependences
  • But also maps antidependences into dependences
  • A(2257) ( A(1256) A(3258) ) / 2.0
  • After scalarization
  • DO I 2, 257
  • A(I) ( A(I-1) A(I1) ) / 2.0
  • ENDDO
  • Loop Reversal gets us
  • DO I 257, 2
  • A(I) ( A(I-1) A(I1) ) / 2.0
  • ENDDO
  • Thus, cannot use loop reversal in presence of
    antidependences

13
Input Prefetching
  • A(2257) ( A(1256) A(3258) ) / 2.0
  • Causes a scalarization fault when naively
    scalarized to
  • DO I 2, 257
  • A(I) ( A(I-1) A(I1) ) / 2.0
  • ENDDO
  • Problem Stores into first element of the LHS in
    the previous iteration
  • Input prefetching Use scalar temporaries to
    store elements of input and output arrays

14
Input Prefetching
  • A first-cut at using temporaries
  • DO I 2, 257
  • T1 A(I-1)
  • T2 ( T1 A(I1) ) / 2.0
  • A(I) T2
  • ENDDO
  • T1 holds element of input array, T2 holds element
    of output array
  • But this faces the same problem. Can correct by
    moving assignment to T1 into previous
    iteration...

15
Input Prefetching
  • T1 A(1)
  • DO I 2, 256
  • T2 ( T1 A(I1) ) / 2.0
  • T1 A(I)
  • A(I) T2
  • ENDDO
  • T2 ( T1 A(257) ) / 2.0
  • A(I) T2
  • Note We are using scalar replacement, but the
    motivation for doing so is different than in
    Chapter 8

16
Input Prefetching
  • Already seen in Chapter 8, we need as many
    temporaries as the dependence threshold 1.
  • Example
  • DO I 2, 257
  • A(I2) A(I) 1.0
  • ENDDO
  • Can be changed to
  • T1 A(1)
  • T2 A(2)
  • DO I 2, 255
  • T3 T1 1.0
  • T1 T2
  • T2 A(I2)
  • A(I2) T3
  • ENDDO
  • T3 T1 1.0
  • T1 T2
  • A(258) T3
  • T3 T1 1.0
  • A(259) T3

17
Input Prefetching
  • Can also unroll the loop and eliminate register
    to register copies
  • Principle 13.2 Any scalarization dependence with
    a threshold known at compile time can be
    corrected by input prefetching.

18
Input Prefetching
  • Sometimes, even when a scalarization dependence
    does not have a constant threshold, input
    prefetching can be used effectively
  • A(1N) A(1N) / A(1)
  • which can be naively scalarized as
  • DO i 1, N
  • A(i) A(i) / A(1)
  • ENDDO
  • true dependence from first iteration to every
    other iteration
  • antidependence from first iteration to itself
  • Via input prefetching, we get
  • tA1 A(1)
  • DO i 1, N
  • A(i) A(i) / tA1
  • ENDDO

19
Loop Splitting
  • Problem with using input prefetching with
    thresholds gt 1 Temporaries must be saved for
    each iteration of the loop up to the threshold
  • Can potentially use loop splitting to solve this
    problem
  • A(36) ( A(14) A(58) ) / 2.0
  • Scalarization loop
  • DO I 3, 6
  • A(I) (A(I-2) A(I2) ) / 2.0
  • ENDDO
  • True dependence and antidependence with threshold
    of 2
  • Split up into two independent loops which do not
    interact with each other...

20
Loop Splitting
  • DO I 3, 6
  • A(I) (A(I-2) A(I2) ) / 2.0
  • ENDDO
  • Both true and anti dependence has threshold of 2
  • With loop splitting becomes
  • DO I 3, 5, 2
  • A(I) (A(I-2) A(I2) ) / 2.0
  • ENDDO
  • DO I 4, 6, 2
  • A(I) (A(I-2) A(I2) ) / 2.0
  • ENDDO
  • Note that the threshold becomes 1 for each loop.
    Could have produced incorrect results if
    threshold of antidependence was not divisible by
    2

21
Loop Splitting
  • Can write the splitting as a nested pair of
    loops
  • DO i1 3, 4
  • DO i2 i1, 6, 2
  • A(i2) (A(i2-2) A(i22) ) / 2.0
  • ENDDO
  • ENDDO
  • The inner loop carries a scalarization dependence
    with threshold 1 and the outer loop carries no
    dependence. Can apply input prefetching
  • DO i1 3, 4
  • T1 A(i1-2)
  • DO i2 i1, 6, 2
  • T2 (T1 A(i2 2)) / 2.0
  • T1 A(i2)
  • A(i2) T2
  • ENDDO
  • ENDDO

22
Loop Splitting
  • Principle 13.3 Any scalarization loop in which
    all true dependences have the same constant
    threshold T and all antidependences have a
    threshold that is divisible by T can be
    transformed, using input prefetching and loop
    splitting, so that all scalarization dependences
    are eliminated.

23
Scalarization Algorithm
  • Revised scalarization Algorithm
  • procedure FullScalarize
  • for each vector statement S do begin
  • compute the dependences of S on itself as
    though S had been scalarized
  • if S has no scalarization dependences upon
    itself then SimpleScalarize(S)
  • else if S has scalarization dependences, but
    no self antidependences
  • then begin
  • SimpleScalarize(S)
  • reverse the scalarization loop
  • end

24
Scalarization Algorithm
  • else if all scalarization dependences have a
    threshold of 1 then begin
  • SimpleScalarize(S)
  • InputPrefetch(S)
  • end
  • else if all scalarization dependences for S
    have the same constant threshold T
  • and all antidependences have thresholds that
    are divisible by T
  • then SplitLoop(S)
  • else if all antidependences for S have the
    same constant threshold T
  • and all true dependences have thresholds that
    are divisible by T then begin
  • reverse the loop
  • SplitLoop(S)
  • end
  • else SafeScalarize(S, SL)
  • end
  • end FullScalarize

25
Multidimensional Scalarization
  • Vector statements in Fortran 90 in more than 1
    dimension
  • A(1100, 1100) B(1100, 1, 1100)
  • corresponds to
  • DO J 1, 100
  • A(1100, J) B(1100, 1, J)
  • ENDDO
  • Scalarization in multiple dimensions
  • A(1100, 1100) 2.0 A(1100, 1100)
  • Obvious Strategy convert each vector iterator
    into a loop
  • DO J 1, 100, 1
  • DO I 1, 100
  • A(I,J) 2.0 A(I,J)
  • ENDDO
  • ENDDO

26
Multidimensional Scalarization
  • What should the order of the loops be after
    scalarization?
  • Familiar question We dealt with this issue in
    Loop Selection/Interchange in Chapter 5
  • Profitability of a particular configuration
    depends on target architecture
  • For simplicity, we shall assume shorter strides
    through memory are better
  • Thus, optimal choice for innermost loop is the
    leftmost vector iterator

27
Multidimensional Scalarization
  • Extending previous results to multiple
    dimensions
  • Each vector iterator is scalarized separately,
    starting from the leftmost vector iterator in the
    innermost loop and the rest of the iterators from
    left to right
  • Once the ordering is available
  • 1. Test to see if the loop carries a
    scalarization dependence. If not, then proceed to
    the next loop.
  • If the scalarization loop carries only true
    dependences, reverse the loop and proceed to the
    next loop.
  • Apply input prefetching, with loop splitting
    where appropriate, to eliminate dependences to
    which it applies. Observe, however, that in outer
    loops, prefetching is done for a single submatrix
    (the remaining dimensions).
  • 4. Otherwise, the loop carries a scalarization
    fault that requires temporary storage. Generate a
    scalarization that utilizes temporary storage and
    terminate the scalarization test for this loop,
    since temporary storage will eliminate all
    scalarization faults.

28
Outer Loop Prefetching
  • A(1N, 1N)
  • (A(0N-1, 2N1) A(2N1, 0N-1)) / 2.0
  • If we try to scalarize this (keeping the column
    iterator in the innermost loop) we get a true
    scalarization dependence (lt, gt) involving the
    second input and an antidependence (gt, lt)
    involving the first input
  • Cannot use loop reversal...

29
Outer Loop Prefetching
  • A(1N, 1N)
  • (A(0N-1, 2N1) A(2N1, 0N-1)) / 2.0
  • We can use input prefetching on the outer loop.
    The temporaries will be arrays
  • T0(1N) A(2N1, 0)
  • DO j 1, N-1
  • T1(1N)( A(0N-1, j1) T0(1N) ) / 2.0
  • T0(1N) A(2N1, j)
  • A(1N, j) T1(1N)
  • ENDDO
  • T1(1N) ( A(0N-1, N) T0(1N) ) / 2.0
  • A(1N, N) T1(1N)
  • Total temporary space required 2 rows of
    original matrix
  • Better than storage required for copy of the
    result matrix

30
Loop Interchange
  • Sometimes, there is a tradeoff between
    scalarization and optimal memory hierarchy usage
  • A(2100, 3101) A(3101, 12012)
  • If we scalarize this using the prescribed order
  • DO I 3, 101
  • DO 100 J 2, 100
  • A(J,I) A(J1,2I-5)
  • ENDDO
  • ENDDO
  • Dependences (lt, gt) (I 3, 4) and (gt, gt) (I 6,
    7)
  • Cannot use loop reversal, input prefetching
  • Can use temporaries

31
Loop Interchange
  • However, we can use loop interchange to get
  • DO J 2, 100
  • DO I 3, 101
  • A(J,I) A(J1,2I-5)
  • ENDDO
  • ENDDO
  • Not optimal memory hierarchy usage, but reduction
    of temporary storage
  • Loop interchange is useful to reduce size of
    temporaries
  • It can also eliminate scalarization dependences

32
General Multidimensional Scalarization
  • Goal To vectorize a single statement which has m
    vector dimensions
  • Given an ideal order of scalarization (l1, l2,
    ..., lm)
  • (d1, d2, ..., dn) be direction vectors for all
    true and antidependences of the statement upon
    itself
  • The scalarization matrix is a n ? m matrix of
    these direction vectors
  • For instance
  • A(1N, 1N, 1N) A(0N-1, 1N, 2N1)
    A(1N, 2N1, 0N-1)
  • gt lt
  • lt gt

33
General Multidimensional Scalarization
  • If we examine any column of the direction matrix,
    we can immediately see if the corresponding loop
    can be safely scalarized as the outermost loop of
    the nest
  • If all entries of the column are or gt, it can
    be safely scalarized as the outermost loop
    without loop reversal.
  • If all entries are or lt, it can be safely
    scalarized with loop reversal.
  • If it contains a mixture of lt and gt, it cannot be
    scalarized by simple means.

34
General Multidimensional Scalarization
  • Once a loop has been selected for scalarization,
    the dependences carried by that loop, any
    dependence whose direction vector does not
    contain a in the position corresponding to the
    selected loop may be eliminated from further
    consideration.
  • In our example, if we move the second column to
    the outside, we get
  • gt lt gt lt
  • lt gt gt lt
  • Scalarization in this way will reduce the matrix
    to
  • lt gt

35
A Complete Scalarization Algorithm
  • procedure CompleteScalarize(S, loop_list)
  • let M be the scalarization direction matrix
    resulting from scalarization S to loop_list
  • while there are more loops to be scalarized do
    begin
  • let l be the first loop in loop list that can
    be simply scalarized with or without loop
    reversal (determine by examining the columns of M
    from left to right)
  • if there is no such l then begin
  • let l be the first loop on loop_list
  • section l by input prefetching
  • if the previous step fails then
  • section S using the naive temporary method
    and exit
  • end

36
A Complete Scalarization Algorithm
  • else // make l the outermost loop
  • section l directly or with loop reversal,
    depending on the entries in the column of M
    corresponding to l (if l is the last loop, use
    hardware section length)
  • remove l from loop_list
  • let M' be M with the column corresponding to l
    and the rows corresponding to non entries
    in that column eliminated
  • M M'
  • end //while
  • end CompleteScalarize
  • Time Complexity O(m2 n) where
  • m number of loops
  • n number of dependences

37
A Complete Scalarization Algorithm
  • Correctness follows from the definition of the
    scalarization matrix and the method to remove
    dependences from the matrix
  • For a given statement S and loop list, Scalarize
    produces a correct scalarization with the
    following properties
  • input prefetching is applied to the innermost
    loop possible
  • the order of scalarization loops is the closest
    possible to the order specified on input among
    scalarizations with property (1).

38
Scalarization Example
  • DO J 2, N-1
  • A(2N-1,J) A(1N-2,J) A(3N,J)
  • A(2N-1,J-1) A(2N-1,J1)/4.
  • ENDDO
  • Loop carried true dependence, antidependence
  • Naive compiler could generate
  • DO J 2, N-1
  • DO i 2, N-1
  • T(i-1) (A(i-1,J) A(i1,J)
  • A(i,J-1) A(i,J1) )/4
  • ENDDO
  • DO i 2, N-1
  • A(i,J) T(i-1)
  • ENDDO
  • ENDDO
  • 2 ? (N-2)2 accesses to memory due to array T

39
Scalarization Example
  • However, can use input prefetching to get
  • DO J 2, N-1
  • tA0 A(1, J)
  • DO i 2, N-2
  • tA1 (tA0A(i1,J)A(i,J-1)A(i,J1))/4
  • tA0 A(i-1, J)
  • A(i,J) tA1
  • ENDDO
  • tA1 (tA0A(N,J)A(N-1,J-1)A(N-1,J1))/4
  • A(N-1,J) tA1
  • ENDDO
  • If temporaries are allocated to registers, no
    more memory accesses than original Fortran 90
    program

40
Post Scalarization Issues
  • Issues due to scalarization
  • Generates many individual loops
  • These loops carry no dependences. So reuse of
    quantities in registers is not common
  • Solution Use loop interchange, loop fusion,
    unroll-and-jam, and scalar replacement
Write a Comment
User Comments (0)
About PowerShow.com