Coarse-Grain Parallelism - PowerPoint PPT Presentation

About This Presentation
Title:

Coarse-Grain Parallelism

Description:

Optimizing Compilers for Modern Architectures. DO I == 1,N. S1 T = A(I) S2 A(I) = B(I) ... Optimizing Compilers for Modern Architectures ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 23
Provided by: engi103
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Coarse-Grain Parallelism


1
Coarse-Grain Parallelism
Chapter 6 of Allen and Kennedy
2
Review
  • SMP machines have multiple processors all
    accessing a central memory.
  • The processors are unrelated, and can run
    separate processes.
  • Starting processes and synchonrization between
    proccesses is expensive.

3
Synchonrization
  • A basic synchonrization element is the barrier.
  • A barrier in a program forces all processes to
    reach a certain point before execution continues.
  • Bus contention can cause slowdowns.

4
Single Loops
  • The analog of scalar expansion is privatization.
  • Temporaries can be given separate namespaces for
    each iteration.

DO I 1,N S1 T A(I) S2 A(I)
B(I) S3 B(I) T ENDDO
PARALLEL DO I 1,N PRIVATE t S1 t
A(I) S2 A(I) B(I) S3 B(I) t ENDDO

5
Privatization
  • Definition A scalar variable x in a loop L is
    said to be privatizable if every path from the
    loop entry to a use of x inside the loop passes
    through a definition of x.
  • Privatizability can be stated as a data-flow
    problem
  • NOTE We can also do this by declaring a variable
    x private if its SSA form doesnt contain a phi
    function at the entry.

6
Loop Distribution
  • Loop distribution eliminates carried
    dependencies.
  • Consequently, it often creates opportunity for
    outer-loop parallelism.
  • We must add extra barriers to keep dependent
    loops from executing out of order, so the
    overhead may override the parallel savings.
  • Attempt other transformations before attempting
    this one.

7
Alignment
  • Many carried dependencies are due to array
    alignment issues.
  • If we can align all references, then dependencies
    would go away, and parallelism is possible.

DO I 2,N A(I) B(I)C(I) D(I)
A(I-1)2.0 ENDDO
DO I 1,N1 IF (I .GT. 1) A(I) B(I)C(I)
IF (I .LE. N) D(I1) A(I)2.0 ENDDO
8
Loop Fusion
  • Loop distribution was a method for separating
    parallel parts of a loop.
  • Our solution attempted to find the maximal loop
    distribution.
  • The maximal distribution often finds
    parallelizable components to small for efficient
    parallelizing.
  • Two obvious solutions
  • Strip mine large loops to create larger
    granularity.
  • Perform maximal distribution, and fuse together
    parallelizable loops.

9
Fusion Safety
Definition A loop-independent dependence between
statements S1 and S2 in loops L1 and L2
respectively is fusion-preventing if fusing L1
and L2 causes the dependence to be carried by the
combined loop in the opposite direction.
DO I 1,N S1 A(I) B(I)C ENDDO DO
I 1,N S2 D(I) A(I1)E ENDDO
DO I 1,N S1 A(I) B(I)C S2 D(I)
A(I1)E ENDDO
10
Fusion Safety
  • We shouldnt fuse loops if the fusing will
    violate ordering of the dependence graph.
  • Ordering Constraint Two loops cant be validly
    fused if there exists a path of loop-independent
    dependencies between them containing a loop or
    statement not being fused with them.

Fusing L1 with L3 violates the ordering
constraint. L1,L3 must occur both before and
after the node L2.
11
Fusion Profitability
Parallel loops should generally not be merged
with sequential loops. Definition An edge
between two statements in loops L1 and L2
respectively is said to be parallelism-inhibiting
if after merging L1 and L2, the dependence is
carried by the combined loop.
DO I 1,N S1 A(I1) B(I) C ENDDO
DO I 1,N S2 D(I) A(I) E ENDDO
DO I 1,N S1 A(I1) B(I) C S2 D(I)
A(I) E ENDDO
12
Loop Interchange
  • Moves dependence-free loops to outermost level
  • Theorem
  • In a perfect nest of loops, a particular loop can
    be parallelized at the outermost level if and
    only if the column of the direction matrix for
    that nest contain only entries
  • Vectorization moves loops to innermost level

13
Loop Interchange
  • DO I 1, N
  • DO J 1, N
  • A(I1, J) A(I, J) B(I, J)
  • ENDDO
  • ENDDO
  • OK for vectorization
  • Problematic for parallelization

14
Loop Interchange
  • PARALLEL DO J 1, N
  • DO I 1, N
  • A(I1, J) A(I, J) B(I, J)
  • ENDDO
  • END PARALLEL DO

15
Loop Interchange
  • Working with direction matrix
  • Move loops with all entries into outermost
    position and parallelize it and remove the column
    from the matrix
  • Move loops with most lt entries into next
    outermost position and sequentialize it,
    eliminate the column and any rows representing
    carried dependences
  • Repeat step 1

16
Loop Reversal
  • DO I 2, N1
  • DO J 2, M1
  • DO K 1, L
  • A(I, J, K) A(I, J-1, K1)
    A(I-1, J, K1)
  • ENDDO
  • ENDDO
  • ENDDO

lt gt lt gt
17
Loop Reversal
  • DO K L, 1, -1
  • PARALLEL DO I 2, N1
  • PARALLEL DO J 2, M1
  • A(I, J, K) A(I, J-1, K1)
    A(I-1, J, K1)
  • END PARALLEL DO
  • END PARALLEL DO
  • ENDDO
  • Increase the range of options available for loop
    selection heuristics

18
Pipeline Parallelism
  • Fortran command DOACROSS
  • Useful where parallelization is not available
  • High synchronization costs
  • DO I 2, N-1
  • DO J 2, N-1
  • A(I, J) .25 (A(I-1, J) A(I, J-1)
    A(I1, J) A(I, J1))
  • ENDDO
  • ENDDO

19
Pipeline Parallelism
  • DOACROSS I 2, N-1
  • POST (EV(1))
  • DO J 2, N-1
  • WAIT(EV(J-1))
  • A(I, J) .25 (A(I-1, J) A(I, J-1)
    A(I1, J) A(I, J1))
  • POST (EV(J))
  • ENDDO
  • ENDDO

20
Pipeline Parallelism
21
Pipeline Parallelism
  • DOACROSS I 2, N-1
  • POST (E(1))
  • K 0
  • DO J 2, N-1, 2
  • K K1
  • WAIT(EV(K))
  • DO j J, MAX(J1, N-1)
  • A(I, J) .25 (A(I-1, J) A(I,
    J-1) A(I1, J) A(I, J1)
  • ENDDO
  • POST (EV(K1))
  • ENDDO
  • ENDDO

22
Pipeline Parallelism
Write a Comment
User Comments (0)
About PowerShow.com