ECE540S Optimizing Compilers - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

ECE540S Optimizing Compilers

Description:

Many efforts to do this: Paraphrase, SUIF, Polaris, ... http://polaris.cs.uiuc.edu/polaris/polaris.html. 7. A Parallel / Doall Loop ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 44
Provided by: Michae7
Category:

less

Transcript and Presenter's Notes

Title: ECE540S Optimizing Compilers


1
ECE540SOptimizing Compilers
  • http//www.eecg.toronto.edu/voss/ece540/
  • Dependence Analysis/ Automatic Parallelization,
    March 26, 2003
  • Michael Wolfe, High Performance Compilers for
    Parallel
  • Computing, Addison-Wesley, 1996. (Chapters 5
    7).

2
When a single processor isnt fast enough
  • Some jobs need to be done faster than any single
    processor can do it.
  • weather modeling, seismic processing (finding
    oil), pharmaceuticals, financial simulations, and
    lots of other science and engineering problems
  • Need to do things faster because
  • if it takes 2 days to predict tomorrows weather,
    whats the use?
  • or maybe it just takes a really long time to
    compute and we have a lot of money to spend on
    the problem.

3
Common Types of Parallel Computers
MEM
MEM
MEM
CPU
CPU
o o o
CPU
CPU
CPU
o o o
CPU
Interconnect
Interconnect
MEMORY
Shared Memory (SMTs will also act like this)
Distributed Memory
4
Programming Models
  • Shared Memory Models
  • explicit threading (using Pthreads for example)
  • compiler directives (such as OpenMP)
  • generally considered easier to program
  • can also run on a distributed memory machine with
    extra support, but performance may not be good on
    them
  • Distributed Memory Models
  • message passing
  • Message Passing Interface (MPI)
  • Parallel Virtual Machine (PVM)
  • generally hard to program, must find parallelism
    at problem level
  • can also run on shared memory machines, and
    performance is sometimes better than using a
    shared-memory model!!!

5
Shared Memory
  • Ideally, sequential programs may be automatically
    transformed by a compiler into parallel programs
    for shared memory machines.
  • Many efforts to do this Paraphrase, SUIF,
    Polaris,
  • Most commercial compilers have a switch to turn
    on automatic parallelization
  • Sun compilers for SPARC have autopar for both C
    and Fortran compilers (speedups swim 1.4, gcc
    1.0 on 2 cpus)
  • Most compilers recognize parallel directives
  • have recently standardized to use OpenMP,
    beginning 1997
  • Again, as with most optimization, focus is on
    loops
  • state of the art is to detect parallel loops

6
http//polaris.cs.uiuc.edu/polaris/polaris.html
7
A Parallel / Doall Loop
  • A loop is fully parallel if no dependencies flow
    across iterations
  • DO I 2, N DO I 2, N
  • A(I) A(I) 1 A(I) A(I-1) 1 ? not
    parallel
  • ENDDO ENDDO
  • Parallel loops are found through dependence
    analysis and dependence tests
  • Usually done at the source-code level, focus is
    on arrays.

8
The Four Types of Dependencies
  • flow dependence the only true dependence. A
    write followed by a read, I1 ?f I2
  • anti dependence a false dependence. A
    read followed by a write, I1 ?a I2
  • an output dependence a false dependence. A
    write followed by another write, I1 ?o I2
  • an input dependence a false dependence. A read
    followed by a read. For the most part, these can
    be ignored. (I1 ?i I2 )
  • Each type implies that I1 must precede I2

9
Example 1
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i) end do
  • There is an instance of S1 that precedes an
    instance of S2 in execution and S1 produces data
    that S2 consumes.
  • S1 is the source of the dependence S2 is the
    sink of the dependence.
  • The dependence flows between instances of
    statements in the same iteration
    (loop-independent dependence).
  • The number of iterations between source and sink
    (dependence distance) is 0. The dependence
    direction is .

10
Example 2
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i-1) end do
  • There is an instance of S1 that precedes an
    instance of S2 in execution and S1 produces data
    that S2 consumes.
  • S1 is the source of the dependence S2 is the
    sink of the dependence.
  • The dependence flows between instances of
    statements in different iterations (loop-carried
    dependence).
  • The number of dependence distance is 1. The
    dependence direction is positive (lt).

11
Example 3
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i1) end do
  • There is an instance of S2 that precedes an
    instance of S1 in execution and S2 consumes data
    that S1 produces.
  • S2 is the source of the dependence S1 is the
    sink.
  • The dependence is loop-carried.
  • The distance is 1. The direction is positive (lt).
  • S1 is before S2 in the loop body, so why lt?

12
Example 4
do i 2, 4 do j 2, 4 S
a(i,j) a(i-1,j1) end do end do
S2,2
S2,3
S2,4
  • An instance of S precedes another instance of S
    and S produces data that S consumes.
  • S is both source and sink.
  • The dependence is loop-carried.
  • The dependence distance is (1,-1).

S3,2
S3,3
S3,4
S4,2
S4,3
S4,4
13
Problem Formulation
  • Consider the following perfect nest of depth d

14
Problem Formulation
  • Dependence will exist if there exists two
    iteration vectors and such that
    and

and
and
and
  • That is

and
and
and
15
Problem Formulation - Example
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i-1) end do
  • Does there exist two iteration vectors i1 and i2,
    such that 2 i1 i2 4 and such that i1
    i2 -1?
  • Answer yes i12 i23 and i13 i2 4.
  • Hence, there is dependence!
  • The dependence distance vector is i2-i1 1.
  • The dependence direction vector is sign(1) lt.

16
Problem Formulation - Example
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i1) end do
  • Does there exist two iteration vectors i1 and i2,
    such that 2 i1 i2 4 and such that i1
    i2 1?
  • Answer yes i13 i22 and i14 i2 3. (But,
    but!).
  • Hence, there is dependence!
  • The dependence distance vector is i2-i1 -1.
  • The dependence direction vector is sign(1) gt.
  • Is this possible?

17
Problem Formulation - Example
do i 1, 10 S1 a(2i) b(i)
c(i) S2 d(i) a(2i1) end do
  • Does there exist two iteration vectors i1 and i2,
    such that 1 i1 i2 10 and such that 2i1
    2i2 1?
  • Answer no 2i1 is even 2i21 is odd.
  • Hence, there is no dependence!

18
Problem Formulation
  • Dependence testing is equivalent to an integer
    linear programming (ILP) problem of 2d variables
    md constraints!
  • An algorithm that determines if there exits two
    iteration vectors and that satisfies
    these constraints is called a dependence tester.
  • The dependence distance vector is given by
    .
  • The dependence direction vector is give by sign(
    ).
  • Dependence testing is NP-complete!
  • A dependence test that reports dependence only
    when there is dependence is said to be exact.
    Otherwise it is in-exact.
  • A dependence test must be conservative if the
    existence of dependence cannot be ascertained,
    dependence must be assumed.

19
Dependence Testers
  • Lamports Test.
  • GCD Test.
  • Banerjees Inequalities.
  • Generalized GCD Test.
  • Power Test.
  • I-Test.
  • Omega Test.
  • Range Test
  • Delta Test.
  • etc

20
Lamports Test
  • Lamports Test is used when there is a single
    index variable in the subscript expressions, and
    when the coefficients of the index variable in
    both expressions are the same.
  • The dependence problem does there exist i1 and
    i2, such that Li i1 i2 Ui and such
    that bi1 c1 bi2 c2? or
  • There is integer solution if and only if
    is integer.
  • The dependence distance is d if Li
    d Ui.
  • d gt 0 Þ true dependence.d 0 Þ loop
    independent dependence.d lt 0 Þ anti dependence.

21
Lamports Test - Example
do i 1, n do j 1, n S
a(i,j) a(i-1,j1) end do end do
  • j1 j2 1?b 1 c1 0 c2 1There is
    dependence.Distance (j) is -1.
  • i1 i2 -1?b 1 c1 0 c2 -1There is
    dependence.Distance (i) is 1.

22
Lamports Test - Example
do i 1, n do j 1, n S
a(i,2j) a(i-1,2j1) end do end
do
  • 2j1 2j2 1?b 2 c1 0 c2 1There
    is no dependence.
  • i1 i2 -1?b 1 c1 0 c2 -1There is
    dependence.Distance (i) is 1.

There is no dependence!
23
GCD Test
  • Given the following equation ais and c
    are integers an integer solution exists if
    and only if gcd(a1,a2,,an) divides
    cProblems
  • ignores loop bounds.
  • gives no information on distance or direction of
    dependence.
  • often gcd() is 1 which always divides c,
    resulting in false dependences.

24
GCD Test - Example
do i 1, 10 S1 a(2i) b(i)
c(i) S2 d(i) a(2i-1) end do
  • Does there exist two iteration vectors i1 and i2,
    such that 1 i1 i2 10 and such that
    2i1 2i2 -1?or 2i2 - 2i1 1?
  • There will be an integer solution if and only if
    gcd(2,-2) divides 1.
  • This is not the case, and hence, there is no
    dependence!

25
GCD Test Example
do i 1, 10 S1 a(i) b(i) c(i) S2
d(i) a(i-100) end do
  • Does there exist two iteration vectors i1 and i2,
    such that 1 i1 i2 10 and such that
    i1 i2 -100?or i2 - i1 100?
  • There will be an integer solution if and only if
    gcd(1,-1) divides 100.
  • This is the case, and hence, there is dependence!
    Or Is there?

26
Dependence Testing Complications
  • Unknown loop bounds.What is the relationship
    between N and 10?
  • Triangular loops.Must impose j lt i as an
    additional constraint.

do i 1, N S1 a(i) a(i10) end
do
do i 1, N do j 1, i-1 S
a(i,j) a(j,i) end do end do
27
More Complications
  • User variables.Same problem as unknown loop
    bounds, but occur due to some loop
    transformations (e.g., normalization).

do i 1, 10 S1 a(i) a(ik) end
do
do i L, H S1 a(i) a(i-1) end
do
ß
do i 1, H-L S1 a(iL) a(iL-1)
end do
28
Serious Complications
  • Aliases.
  • Equivalence Statements in Fortran real
    a(10,10), b(10)makes b the same as the first
    column of a.
  • Common blocks Fortrans way of having
    shared/global variables.common /shared/a,b,c
    subroutine foo
    ()common /shared/a,b,ccommon /shared/x,y,z

29
Loop Parallelization
  • A dependence is said to be carried by a loop if
    the loop is the outmost loop whose removal
    eliminates the dependence. If a dependence is not
    carried by the loop, it is loop-independent.

do i 2, n-1 do j 2, m-1 a(i, j)
... a(i, j) b(i,
j) b(i, j-1) c(i, j)
c(i-1, j) end do end do
  • Outermost loop with non dependence carries it.

30
Loop Parallelization
  • The iterations of a loop may be executed in
    parallel with one another if and only if no
    dependences are carried by the loop!

31
Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i, j-1) end do end do
  • Iterations of loop j must be executed
    sequentially, but the iterations of loop i may be
    executed in parallel.
  • Outer loop parallelism.

32
Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i-1, j) end do end do
  • Iterations of loop i must be executed
    sequentially, but the iterations of loop j may be
    executed in parallel.
  • Inner loop parallelism.

33
Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i-1, j-1) end do end do
  • Iterations of loop i must be executed
    sequentially, but the iterations of loop j may be
    executed in parallel. Why?
  • Inner loop parallelism.

34
OpenMP API
Fortran API
C/C API is similar (COMP ? pragma omp
PARALLEL DO ?
parallel for
)
35
Techniques for Breaking Dependenciesand for
Dealing with Scalars
  • Privatization
  • remove dependencies create by use of temporary
    workspaces
  • Induction Variable Substitution
  • find closed solutions for basic induction
    variables
  • Reduction
  • break reduction operations into local and global
    reductions

36
Privatization (Scalar Expansion)
INTEGER J(N) COMP PARALLEL DO DO I 1, N
J(I) A(I) J(I) ENDDO
INTEGER J DO I 1, N J A(I) J ENDDO
  • All scalar assignments will cause loop-carried
    dependencies
  • Can create local per-iteration or (more
    practically) per thread copies
  • scalar expansion or privatization

37
Privatization (Scalar Expansion)
INTEGER J(P) COMP PARALLEL DO DO I 1, N
J(tid()) A(I) J(tid()) ENDDO
INTEGER J DO I 1, N J A(I) J ENDDO
Where tid() returns thread id (1 P) and P is
total number of threads.
  • All scalar assignments will cause loop-carried
    dependencies
  • Can create local per-iteration or (more
    practically) per thread copies
  • scalar expansion or privatization

38
Privatization (Scalar Expansion)
COMP PARALLEL DO COMP PRIVATE(J) DO I 1, N
J A(I) J ENDDO
DO I 1, N J A(I) J ENDDO
  • All scalar assignments will cause loop-carried
    dependencies
  • Can create local per-iteration or (more
    practically) per thread copies
  • scalar expansion or privatization

39
Array Privatization (Array Expansion)
COMP PARALLEL DO COMP PRIVATE(A) DO I 1, N
DO J 1, N A(I) ENDDO DO J 1, N
A(I) ENDDO ENDDO
DO I 1, N DO J 1, N A(I) ENDDO
DO J 1, N A(I) ENDDO ENDDO
  • Arrays may be used as temporary workspaces
  • If each read of an element of an array is
    preceded in the same iteration by a write of that
    element, the array may be privatized or expanded

40
Induction Variable Substitution
COMP PARALLEL DO COMP PRIVATE(J0) DO I 1, N
J0 J IK A(I) J0 ENDDO
DO I 1, N J J K A(I) J ENDDO
BIV
  • Basic induction variables cause flow dependencies
  • Can be replaced by a closed-form solution
  • an induction variable derived from the loop
    control variable
  • Complete opposite of strength reduction of loops

41
Reduction Recognition
COMP PARALLEL DO DO I 1, P sum0(I)
0 ENDDO COMP PARALLEL DO DO I 1, N
sum0(tid()) sum0(tid())
A(I) ENDDO DO I 1, P sum sum sum0(I) ENDDO
DO I 1, N sum sum A(I) ENDDO
  • A reduction operation of the form X X can
    be replaced by a local reduction phase followed
    by a global reduction phase
  • There is tradeoff since the global reduction is
    still sequential -- if N is large you win.

42
Reduction Recognition
COMP PARALLEL DO COMP REDUCTION(sum) DO I
1, N sum sum A(I) ENDDO
DO I 1, N sum sum A(I) ENDDO
  • A reduction operation of the form X X can
    be replaced by a local reduction phase followed
    by a global reduction phase
  • There is tradeoff since the global reduction is
    still sequential -- if N is large you win.

43
http//polaris.cs.uiuc.edu/polaris/polaris.html
Write a Comment
User Comments (0)
About PowerShow.com