Title: ECE540S Optimizing Compilers
1ECE540SOptimizing Compilers
- http//www.eecg.toronto.edu/voss/ece540/
- Dependence Analysis/ Automatic Parallelization,
March 26, 2003 - Michael Wolfe, High Performance Compilers for
Parallel - Computing, Addison-Wesley, 1996. (Chapters 5
7).
2When a single processor isnt fast enough
- Some jobs need to be done faster than any single
processor can do it. - weather modeling, seismic processing (finding
oil), pharmaceuticals, financial simulations, and
lots of other science and engineering problems - Need to do things faster because
- if it takes 2 days to predict tomorrows weather,
whats the use? - or maybe it just takes a really long time to
compute and we have a lot of money to spend on
the problem.
3Common Types of Parallel Computers
MEM
MEM
MEM
CPU
CPU
o o o
CPU
CPU
CPU
o o o
CPU
Interconnect
Interconnect
MEMORY
Shared Memory (SMTs will also act like this)
Distributed Memory
4Programming Models
- Shared Memory Models
- explicit threading (using Pthreads for example)
- compiler directives (such as OpenMP)
- generally considered easier to program
- can also run on a distributed memory machine with
extra support, but performance may not be good on
them - Distributed Memory Models
- message passing
- Message Passing Interface (MPI)
- Parallel Virtual Machine (PVM)
- generally hard to program, must find parallelism
at problem level - can also run on shared memory machines, and
performance is sometimes better than using a
shared-memory model!!!
5Shared Memory
- Ideally, sequential programs may be automatically
transformed by a compiler into parallel programs
for shared memory machines. - Many efforts to do this Paraphrase, SUIF,
Polaris, - Most commercial compilers have a switch to turn
on automatic parallelization - Sun compilers for SPARC have autopar for both C
and Fortran compilers (speedups swim 1.4, gcc
1.0 on 2 cpus) - Most compilers recognize parallel directives
- have recently standardized to use OpenMP,
beginning 1997 - Again, as with most optimization, focus is on
loops - state of the art is to detect parallel loops
6http//polaris.cs.uiuc.edu/polaris/polaris.html
7A Parallel / Doall Loop
- A loop is fully parallel if no dependencies flow
across iterations - DO I 2, N DO I 2, N
- A(I) A(I) 1 A(I) A(I-1) 1 ? not
parallel - ENDDO ENDDO
- Parallel loops are found through dependence
analysis and dependence tests - Usually done at the source-code level, focus is
on arrays.
8The Four Types of Dependencies
- flow dependence the only true dependence. A
write followed by a read, I1 ?f I2 - anti dependence a false dependence. A
read followed by a write, I1 ?a I2 - an output dependence a false dependence. A
write followed by another write, I1 ?o I2 - an input dependence a false dependence. A read
followed by a read. For the most part, these can
be ignored. (I1 ?i I2 ) - Each type implies that I1 must precede I2
9Example 1
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i) end do
- There is an instance of S1 that precedes an
instance of S2 in execution and S1 produces data
that S2 consumes. - S1 is the source of the dependence S2 is the
sink of the dependence. - The dependence flows between instances of
statements in the same iteration
(loop-independent dependence). - The number of iterations between source and sink
(dependence distance) is 0. The dependence
direction is .
10Example 2
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i-1) end do
- There is an instance of S1 that precedes an
instance of S2 in execution and S1 produces data
that S2 consumes. - S1 is the source of the dependence S2 is the
sink of the dependence. - The dependence flows between instances of
statements in different iterations (loop-carried
dependence). - The number of dependence distance is 1. The
dependence direction is positive (lt).
11Example 3
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i1) end do
- There is an instance of S2 that precedes an
instance of S1 in execution and S2 consumes data
that S1 produces. - S2 is the source of the dependence S1 is the
sink. - The dependence is loop-carried.
- The distance is 1. The direction is positive (lt).
- S1 is before S2 in the loop body, so why lt?
12Example 4
do i 2, 4 do j 2, 4 S
a(i,j) a(i-1,j1) end do end do
S2,2
S2,3
S2,4
- An instance of S precedes another instance of S
and S produces data that S consumes. - S is both source and sink.
- The dependence is loop-carried.
- The dependence distance is (1,-1).
S3,2
S3,3
S3,4
S4,2
S4,3
S4,4
13Problem Formulation
- Consider the following perfect nest of depth d
14Problem Formulation
- Dependence will exist if there exists two
iteration vectors and such that
and
and
and
and
and
and
and
15Problem Formulation - Example
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i-1) end do
- Does there exist two iteration vectors i1 and i2,
such that 2 i1 i2 4 and such that i1
i2 -1? - Answer yes i12 i23 and i13 i2 4.
- Hence, there is dependence!
- The dependence distance vector is i2-i1 1.
- The dependence direction vector is sign(1) lt.
16Problem Formulation - Example
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i1) end do
- Does there exist two iteration vectors i1 and i2,
such that 2 i1 i2 4 and such that i1
i2 1? - Answer yes i13 i22 and i14 i2 3. (But,
but!). - Hence, there is dependence!
- The dependence distance vector is i2-i1 -1.
- The dependence direction vector is sign(1) gt.
- Is this possible?
17Problem Formulation - Example
do i 1, 10 S1 a(2i) b(i)
c(i) S2 d(i) a(2i1) end do
- Does there exist two iteration vectors i1 and i2,
such that 1 i1 i2 10 and such that 2i1
2i2 1? - Answer no 2i1 is even 2i21 is odd.
- Hence, there is no dependence!
18Problem Formulation
- Dependence testing is equivalent to an integer
linear programming (ILP) problem of 2d variables
md constraints! - An algorithm that determines if there exits two
iteration vectors and that satisfies
these constraints is called a dependence tester. - The dependence distance vector is given by
. - The dependence direction vector is give by sign(
). - Dependence testing is NP-complete!
- A dependence test that reports dependence only
when there is dependence is said to be exact.
Otherwise it is in-exact. - A dependence test must be conservative if the
existence of dependence cannot be ascertained,
dependence must be assumed.
19Dependence Testers
- Lamports Test.
- GCD Test.
- Banerjees Inequalities.
- Generalized GCD Test.
- Power Test.
- I-Test.
- Omega Test.
- Range Test
- Delta Test.
- etc
20Lamports Test
- Lamports Test is used when there is a single
index variable in the subscript expressions, and
when the coefficients of the index variable in
both expressions are the same. - The dependence problem does there exist i1 and
i2, such that Li i1 i2 Ui and such
that bi1 c1 bi2 c2? or - There is integer solution if and only if
is integer. - The dependence distance is d if Li
d Ui. - d gt 0 Þ true dependence.d 0 Þ loop
independent dependence.d lt 0 Þ anti dependence.
21Lamports Test - Example
do i 1, n do j 1, n S
a(i,j) a(i-1,j1) end do end do
- j1 j2 1?b 1 c1 0 c2 1There is
dependence.Distance (j) is -1.
- i1 i2 -1?b 1 c1 0 c2 -1There is
dependence.Distance (i) is 1.
22Lamports Test - Example
do i 1, n do j 1, n S
a(i,2j) a(i-1,2j1) end do end
do
- 2j1 2j2 1?b 2 c1 0 c2 1There
is no dependence.
- i1 i2 -1?b 1 c1 0 c2 -1There is
dependence.Distance (i) is 1.
There is no dependence!
23GCD Test
- Given the following equation ais and c
are integers an integer solution exists if
and only if gcd(a1,a2,,an) divides
cProblems - ignores loop bounds.
- gives no information on distance or direction of
dependence. - often gcd() is 1 which always divides c,
resulting in false dependences.
24GCD Test - Example
do i 1, 10 S1 a(2i) b(i)
c(i) S2 d(i) a(2i-1) end do
- Does there exist two iteration vectors i1 and i2,
such that 1 i1 i2 10 and such that
2i1 2i2 -1?or 2i2 - 2i1 1? - There will be an integer solution if and only if
gcd(2,-2) divides 1. - This is not the case, and hence, there is no
dependence!
25GCD Test Example
do i 1, 10 S1 a(i) b(i) c(i) S2
d(i) a(i-100) end do
- Does there exist two iteration vectors i1 and i2,
such that 1 i1 i2 10 and such that
i1 i2 -100?or i2 - i1 100? - There will be an integer solution if and only if
gcd(1,-1) divides 100. - This is the case, and hence, there is dependence!
Or Is there?
26Dependence Testing Complications
- Unknown loop bounds.What is the relationship
between N and 10? - Triangular loops.Must impose j lt i as an
additional constraint.
do i 1, N S1 a(i) a(i10) end
do
do i 1, N do j 1, i-1 S
a(i,j) a(j,i) end do end do
27More Complications
- User variables.Same problem as unknown loop
bounds, but occur due to some loop
transformations (e.g., normalization).
do i 1, 10 S1 a(i) a(ik) end
do
do i L, H S1 a(i) a(i-1) end
do
ß
do i 1, H-L S1 a(iL) a(iL-1)
end do
28Serious Complications
- Aliases.
- Equivalence Statements in Fortran real
a(10,10), b(10)makes b the same as the first
column of a. - Common blocks Fortrans way of having
shared/global variables.common /shared/a,b,c
subroutine foo
()common /shared/a,b,ccommon /shared/x,y,z
29Loop Parallelization
- A dependence is said to be carried by a loop if
the loop is the outmost loop whose removal
eliminates the dependence. If a dependence is not
carried by the loop, it is loop-independent.
do i 2, n-1 do j 2, m-1 a(i, j)
... a(i, j) b(i,
j) b(i, j-1) c(i, j)
c(i-1, j) end do end do
- Outermost loop with non dependence carries it.
30Loop Parallelization
- The iterations of a loop may be executed in
parallel with one another if and only if no
dependences are carried by the loop!
31Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i, j-1) end do end do
- Iterations of loop j must be executed
sequentially, but the iterations of loop i may be
executed in parallel. - Outer loop parallelism.
32Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i-1, j) end do end do
- Iterations of loop i must be executed
sequentially, but the iterations of loop j may be
executed in parallel. - Inner loop parallelism.
33Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i-1, j-1) end do end do
- Iterations of loop i must be executed
sequentially, but the iterations of loop j may be
executed in parallel. Why? - Inner loop parallelism.
34OpenMP API
Fortran API
C/C API is similar (COMP ? pragma omp
PARALLEL DO ?
parallel for
)
35Techniques for Breaking Dependenciesand for
Dealing with Scalars
- Privatization
- remove dependencies create by use of temporary
workspaces - Induction Variable Substitution
- find closed solutions for basic induction
variables - Reduction
- break reduction operations into local and global
reductions
36Privatization (Scalar Expansion)
INTEGER J(N) COMP PARALLEL DO DO I 1, N
J(I) A(I) J(I) ENDDO
INTEGER J DO I 1, N J A(I) J ENDDO
- All scalar assignments will cause loop-carried
dependencies - Can create local per-iteration or (more
practically) per thread copies - scalar expansion or privatization
37Privatization (Scalar Expansion)
INTEGER J(P) COMP PARALLEL DO DO I 1, N
J(tid()) A(I) J(tid()) ENDDO
INTEGER J DO I 1, N J A(I) J ENDDO
Where tid() returns thread id (1 P) and P is
total number of threads.
- All scalar assignments will cause loop-carried
dependencies - Can create local per-iteration or (more
practically) per thread copies - scalar expansion or privatization
38Privatization (Scalar Expansion)
COMP PARALLEL DO COMP PRIVATE(J) DO I 1, N
J A(I) J ENDDO
DO I 1, N J A(I) J ENDDO
- All scalar assignments will cause loop-carried
dependencies - Can create local per-iteration or (more
practically) per thread copies - scalar expansion or privatization
39Array Privatization (Array Expansion)
COMP PARALLEL DO COMP PRIVATE(A) DO I 1, N
DO J 1, N A(I) ENDDO DO J 1, N
A(I) ENDDO ENDDO
DO I 1, N DO J 1, N A(I) ENDDO
DO J 1, N A(I) ENDDO ENDDO
- Arrays may be used as temporary workspaces
- If each read of an element of an array is
preceded in the same iteration by a write of that
element, the array may be privatized or expanded
40Induction Variable Substitution
COMP PARALLEL DO COMP PRIVATE(J0) DO I 1, N
J0 J IK A(I) J0 ENDDO
DO I 1, N J J K A(I) J ENDDO
BIV
- Basic induction variables cause flow dependencies
- Can be replaced by a closed-form solution
- an induction variable derived from the loop
control variable - Complete opposite of strength reduction of loops
41Reduction Recognition
COMP PARALLEL DO DO I 1, P sum0(I)
0 ENDDO COMP PARALLEL DO DO I 1, N
sum0(tid()) sum0(tid())
A(I) ENDDO DO I 1, P sum sum sum0(I) ENDDO
DO I 1, N sum sum A(I) ENDDO
- A reduction operation of the form X X can
be replaced by a local reduction phase followed
by a global reduction phase - There is tradeoff since the global reduction is
still sequential -- if N is large you win.
42Reduction Recognition
COMP PARALLEL DO COMP REDUCTION(sum) DO I
1, N sum sum A(I) ENDDO
DO I 1, N sum sum A(I) ENDDO
- A reduction operation of the form X X can
be replaced by a local reduction phase followed
by a global reduction phase - There is tradeoff since the global reduction is
still sequential -- if N is large you win.
43http//polaris.cs.uiuc.edu/polaris/polaris.html