ECE540S Optimizing Compilers - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

ECE540S Optimizing Compilers

Description:

Many efforts to do this: Paraphrase, SUIF, Polaris, ... http://polaris.cs.uiuc.edu/polaris/polaris.html. 7. A Parallel / Doall Loop ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 44

Provided by: Michae7

Category:

more less

Transcript and Presenter's Notes

Title: ECE540S Optimizing Compilers

1
ECE540SOptimizing Compilers

http//www.eecg.toronto.edu/voss/ece540/
Dependence Analysis/ Automatic Parallelization,
March 26, 2003
Michael Wolfe, High Performance Compilers for
Parallel
Computing, Addison-Wesley, 1996. (Chapters 5
7).

2
When a single processor isnt fast enough

Some jobs need to be done faster than any single
processor can do it.
weather modeling, seismic processing (finding
oil), pharmaceuticals, financial simulations, and
lots of other science and engineering problems
Need to do things faster because
if it takes 2 days to predict tomorrows weather,
whats the use?
or maybe it just takes a really long time to
compute and we have a lot of money to spend on
the problem.

3
Common Types of Parallel Computers
MEM
MEM
MEM
CPU
CPU
o o o
CPU
CPU
CPU
o o o
CPU
Interconnect
Interconnect
MEMORY
Shared Memory (SMTs will also act like this)
Distributed Memory
4
Programming Models

Shared Memory Models
explicit threading (using Pthreads for example)
compiler directives (such as OpenMP)
generally considered easier to program
can also run on a distributed memory machine with
extra support, but performance may not be good on
them
Distributed Memory Models
message passing
Message Passing Interface (MPI)
Parallel Virtual Machine (PVM)
generally hard to program, must find parallelism
at problem level
can also run on shared memory machines, and
performance is sometimes better than using a
shared-memory model!!!

5
Shared Memory

Ideally, sequential programs may be automatically
transformed by a compiler into parallel programs
for shared memory machines.
Many efforts to do this Paraphrase, SUIF,
Polaris,
Most commercial compilers have a switch to turn
on automatic parallelization
Sun compilers for SPARC have autopar for both C
and Fortran compilers (speedups swim 1.4, gcc
1.0 on 2 cpus)
Most compilers recognize parallel directives
have recently standardized to use OpenMP,
beginning 1997
Again, as with most optimization, focus is on
loops
state of the art is to detect parallel loops

6
http//polaris.cs.uiuc.edu/polaris/polaris.html
7
A Parallel / Doall Loop

A loop is fully parallel if no dependencies flow
across iterations
DO I 2, N DO I 2, N
A(I) A(I) 1 A(I) A(I-1) 1 ? not
parallel
ENDDO ENDDO
Parallel loops are found through dependence
analysis and dependence tests
Usually done at the source-code level, focus is
on arrays.

8
The Four Types of Dependencies

flow dependence the only true dependence. A
write followed by a read, I1 ?f I2
anti dependence a false dependence. A
read followed by a write, I1 ?a I2
an output dependence a false dependence. A
write followed by another write, I1 ?o I2
an input dependence a false dependence. A read
followed by a read. For the most part, these can
be ignored. (I1 ?i I2 )
Each type implies that I1 must precede I2

9
Example 1
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i) end do

There is an instance of S1 that precedes an
instance of S2 in execution and S1 produces data
that S2 consumes.
S1 is the source of the dependence S2 is the
sink of the dependence.
The dependence flows between instances of
statements in the same iteration
(loop-independent dependence).
The number of iterations between source and sink
(dependence distance) is 0. The dependence
direction is .

10
Example 2
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i-1) end do

There is an instance of S1 that precedes an
instance of S2 in execution and S1 produces data
that S2 consumes.
S1 is the source of the dependence S2 is the
sink of the dependence.
The dependence flows between instances of
statements in different iterations (loop-carried
dependence).
The number of dependence distance is 1. The
dependence direction is positive (lt).

11
Example 3
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i1) end do

There is an instance of S2 that precedes an
instance of S1 in execution and S2 consumes data
that S1 produces.
S2 is the source of the dependence S1 is the
sink.
The dependence is loop-carried.
The distance is 1. The direction is positive (lt).
S1 is before S2 in the loop body, so why lt?

12
Example 4
do i 2, 4 do j 2, 4 S
a(i,j) a(i-1,j1) end do end do
S2,2
S2,3
S2,4

An instance of S precedes another instance of S
and S produces data that S consumes.
S is both source and sink.
The dependence is loop-carried.
The dependence distance is (1,-1).

S3,2
S3,3
S3,4
S4,2
S4,3
S4,4
13
Problem Formulation

Consider the following perfect nest of depth d

14
Problem Formulation

Dependence will exist if there exists two
iteration vectors and such that
and

and
and
and

That is

and
and
and
15
Problem Formulation - Example
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i-1) end do

Does there exist two iteration vectors i1 and i2,
such that 2 i1 i2 4 and such that i1
i2 -1?
Answer yes i12 i23 and i13 i2 4.
Hence, there is dependence!
The dependence distance vector is i2-i1 1.
The dependence direction vector is sign(1) lt.

16
Problem Formulation - Example
do i 2, 4 S1 a(i) b(i) c(i) S2
d(i) a(i1) end do

Does there exist two iteration vectors i1 and i2,
such that 2 i1 i2 4 and such that i1
i2 1?
Answer yes i13 i22 and i14 i2 3. (But,
but!).
Hence, there is dependence!
The dependence distance vector is i2-i1 -1.
The dependence direction vector is sign(1) gt.
Is this possible?

17
Problem Formulation - Example
do i 1, 10 S1 a(2i) b(i)
c(i) S2 d(i) a(2i1) end do

Does there exist two iteration vectors i1 and i2,
such that 1 i1 i2 10 and such that 2i1
2i2 1?
Answer no 2i1 is even 2i21 is odd.
Hence, there is no dependence!

18
Problem Formulation

Dependence testing is equivalent to an integer
linear programming (ILP) problem of 2d variables
md constraints!
An algorithm that determines if there exits two
iteration vectors and that satisfies
these constraints is called a dependence tester.
The dependence distance vector is given by
.
The dependence direction vector is give by sign(
).
Dependence testing is NP-complete!
A dependence test that reports dependence only
when there is dependence is said to be exact.
Otherwise it is in-exact.
A dependence test must be conservative if the
existence of dependence cannot be ascertained,
dependence must be assumed.

19
Dependence Testers

Lamports Test.
GCD Test.
Banerjees Inequalities.
Generalized GCD Test.
Power Test.
I-Test.
Omega Test.
Range Test
Delta Test.
etc

20
Lamports Test

Lamports Test is used when there is a single
index variable in the subscript expressions, and
when the coefficients of the index variable in
both expressions are the same.
The dependence problem does there exist i1 and
i2, such that Li i1 i2 Ui and such
that bi1 c1 bi2 c2? or
There is integer solution if and only if
is integer.
The dependence distance is d if Li
d Ui.
d gt 0 Þ true dependence.d 0 Þ loop
independent dependence.d lt 0 Þ anti dependence.

21
Lamports Test - Example
do i 1, n do j 1, n S
a(i,j) a(i-1,j1) end do end do

j1 j2 1?b 1 c1 0 c2 1There is
dependence.Distance (j) is -1.

i1 i2 -1?b 1 c1 0 c2 -1There is
dependence.Distance (i) is 1.

22
Lamports Test - Example
do i 1, n do j 1, n S
a(i,2j) a(i-1,2j1) end do end
do

2j1 2j2 1?b 2 c1 0 c2 1There
is no dependence.

i1 i2 -1?b 1 c1 0 c2 -1There is
dependence.Distance (i) is 1.

There is no dependence!
23
GCD Test

Given the following equation ais and c
are integers an integer solution exists if
and only if gcd(a1,a2,,an) divides
cProblems
ignores loop bounds.
gives no information on distance or direction of
dependence.
often gcd() is 1 which always divides c,
resulting in false dependences.

24
GCD Test - Example
do i 1, 10 S1 a(2i) b(i)
c(i) S2 d(i) a(2i-1) end do

Does there exist two iteration vectors i1 and i2,
such that 1 i1 i2 10 and such that
2i1 2i2 -1?or 2i2 - 2i1 1?
There will be an integer solution if and only if
gcd(2,-2) divides 1.
This is not the case, and hence, there is no
dependence!

25
GCD Test Example
do i 1, 10 S1 a(i) b(i) c(i) S2
d(i) a(i-100) end do

Does there exist two iteration vectors i1 and i2,
such that 1 i1 i2 10 and such that
i1 i2 -100?or i2 - i1 100?
There will be an integer solution if and only if
gcd(1,-1) divides 100.
This is the case, and hence, there is dependence!
Or Is there?

26
Dependence Testing Complications

Unknown loop bounds.What is the relationship
between N and 10?
Triangular loops.Must impose j lt i as an
additional constraint.

do i 1, N S1 a(i) a(i10) end
do
do i 1, N do j 1, i-1 S
a(i,j) a(j,i) end do end do
27
More Complications

User variables.Same problem as unknown loop
bounds, but occur due to some loop
transformations (e.g., normalization).

do i 1, 10 S1 a(i) a(ik) end
do
do i L, H S1 a(i) a(i-1) end
do
ß
do i 1, H-L S1 a(iL) a(iL-1)
end do
28
Serious Complications

Aliases.
Equivalence Statements in Fortran real
a(10,10), b(10)makes b the same as the first
column of a.
Common blocks Fortrans way of having
shared/global variables.common /shared/a,b,c
subroutine foo
()common /shared/a,b,ccommon /shared/x,y,z

29
Loop Parallelization

A dependence is said to be carried by a loop if
the loop is the outmost loop whose removal
eliminates the dependence. If a dependence is not
carried by the loop, it is loop-independent.

do i 2, n-1 do j 2, m-1 a(i, j)
... a(i, j) b(i,
j) b(i, j-1) c(i, j)
c(i-1, j) end do end do

Outermost loop with non dependence carries it.

30
Loop Parallelization

The iterations of a loop may be executed in
parallel with one another if and only if no
dependences are carried by the loop!

31
Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i, j-1) end do end do

Iterations of loop j must be executed
sequentially, but the iterations of loop i may be
executed in parallel.
Outer loop parallelism.

32
Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i-1, j) end do end do

Iterations of loop i must be executed
sequentially, but the iterations of loop j may be
executed in parallel.
Inner loop parallelism.

33
Loop Parallelization - Example
do i 2, n-1 do j 2, m-1 b(i, j)
b(i-1, j-1) end do end do

Iterations of loop i must be executed
sequentially, but the iterations of loop j may be
executed in parallel. Why?
Inner loop parallelism.

34
OpenMP API
Fortran API
C/C API is similar (COMP ? pragma omp
PARALLEL DO ?
parallel for
)
35
Techniques for Breaking Dependenciesand for
Dealing with Scalars

Privatization
remove dependencies create by use of temporary
workspaces
Induction Variable Substitution
find closed solutions for basic induction
variables
Reduction
break reduction operations into local and global
reductions

36
Privatization (Scalar Expansion)
INTEGER J(N) COMP PARALLEL DO DO I 1, N
J(I) A(I) J(I) ENDDO
INTEGER J DO I 1, N J A(I) J ENDDO

All scalar assignments will cause loop-carried
dependencies
Can create local per-iteration or (more
practically) per thread copies
scalar expansion or privatization

37
Privatization (Scalar Expansion)
INTEGER J(P) COMP PARALLEL DO DO I 1, N
J(tid()) A(I) J(tid()) ENDDO
INTEGER J DO I 1, N J A(I) J ENDDO
Where tid() returns thread id (1 P) and P is
total number of threads.

All scalar assignments will cause loop-carried
dependencies
Can create local per-iteration or (more
practically) per thread copies
scalar expansion or privatization

38
Privatization (Scalar Expansion)
COMP PARALLEL DO COMP PRIVATE(J) DO I 1, N
J A(I) J ENDDO
DO I 1, N J A(I) J ENDDO

All scalar assignments will cause loop-carried
dependencies
Can create local per-iteration or (more
practically) per thread copies
scalar expansion or privatization

39
Array Privatization (Array Expansion)
COMP PARALLEL DO COMP PRIVATE(A) DO I 1, N
DO J 1, N A(I) ENDDO DO J 1, N
A(I) ENDDO ENDDO
DO I 1, N DO J 1, N A(I) ENDDO
DO J 1, N A(I) ENDDO ENDDO

Arrays may be used as temporary workspaces
If each read of an element of an array is
preceded in the same iteration by a write of that
element, the array may be privatized or expanded

40
Induction Variable Substitution
COMP PARALLEL DO COMP PRIVATE(J0) DO I 1, N
J0 J IK A(I) J0 ENDDO
DO I 1, N J J K A(I) J ENDDO
BIV

Basic induction variables cause flow dependencies
Can be replaced by a closed-form solution
an induction variable derived from the loop
control variable
Complete opposite of strength reduction of loops

41
Reduction Recognition
COMP PARALLEL DO DO I 1, P sum0(I)
0 ENDDO COMP PARALLEL DO DO I 1, N
sum0(tid()) sum0(tid())
A(I) ENDDO DO I 1, P sum sum sum0(I) ENDDO
DO I 1, N sum sum A(I) ENDDO

A reduction operation of the form X X can
be replaced by a local reduction phase followed
by a global reduction phase
There is tradeoff since the global reduction is
still sequential -- if N is large you win.

42
Reduction Recognition
COMP PARALLEL DO COMP REDUCTION(sum) DO I
1, N sum sum A(I) ENDDO
DO I 1, N sum sum A(I) ENDDO

A reduction operation of the form X X can
be replaced by a local reduction phase followed
by a global reduction phase
There is tradeoff since the global reduction is
still sequential -- if N is large you win.

43
http//polaris.cs.uiuc.edu/polaris/polaris.html

Write a Comment

User Comments (0)