Title: Coarse-Grain Parallelism
 1Coarse-Grain Parallelism
Chapter 6 of Allen and Kennedy 
 2Introduction
-  Previously, our transformations targeted vector 
and superscalar architectures.  -  In this lecture, we worry about transformations 
for symmetric multiprocessor machines.  -  The difference between these transformations 
tends to be one of granularity. 
  3Review
- SMP machines have multiple processors all 
accessing a central memory.  - The processors are unrelated, and can run 
separate processes.  - Starting processes and synchonrization between 
proccesses is expensive. 
  4Synchonrization
- A basic synchonrization element is the barrier. 
 - A barrier in a program forces all processes to 
reach a certain point before execution continues.  - Bus contention can cause slowdowns.
 
  5Single Loops
- The analog of scalar expansion is privatization. 
 - Temporaries can be given separate namespaces for 
each iteration. 
 DO I  1,N S1 T  A(I) S2 A(I)  
B(I) S3 B(I)  T ENDDO
 PARALLEL DO I  1,N PRIVATE t S1 t  
A(I) S2 A(I)  B(I) S3 B(I)  t ENDDO   
 6Privatization
- Definition A scalar variable x in a loop L is 
said to be privatizable if every path from the 
loop entry to a use of x inside the loop passes 
through a definition of x.  - Privatizability can be stated as a data-flow 
problem  - We can also do this by declaring a variable x 
private if its SSA graph doesnt contain a phi 
function at the entry.  
  7Array Privatization
- We need to privatize array variables. 
 - For iteration J, upwards exposed variables are 
those exposed due to loop body without variables 
defined earlier.  
 DO I  1,100 S0 T(1)X L1 DO J  
2,N S1 T(J)  T(J-1)B(I,J) S2 
A(I,J)  T(J) ENDDO ENDDO
So for this fragment, T(1) is the only exposed 
variable. 
 8Array Privatization
- Using this analysis, we get the following code
 
 PARALLEL DO I  1,100 PRIVATE t S0 
t(1)  X L1 DO J  2,N S1 t(J)  
t(J-1)B(I,J) S2 A(I,J)t(J) ENDDO 
ENDDO 
 9Loop Distribution
- Loop distribution eliminates carried 
dependencies.  - Consequently, it often creates opportunity for 
outer-loop parallelism.  - We must add extra barriers to keep dependent 
loops from executing out of order, so the 
overhead may override the parallel savings.  - Attempt other transformations before attempting 
this one. 
  10Alignment
- Many carried dependencies are due to array 
alignment issues.  - If we can align all references, then dependencies 
would go away, and parallelism is possible. 
DO I  2,N A(I)  B(I)C(I) D(I)  
A(I-1)2.0 ENDDO
DO I  1,N1 IF (I .GT. 1) A(I)  B(I)C(I) 
IF (I .LE. N) D(I1)  A(I)2.0 ENDDO 
 11Alignment
- There are other ways to align the loop 
 
DO I  2,N J  MOD(IN-4,N-1)2 A(J)  
B(J)C D(I)A(I-1)2.0 ENDDO 
D(2)  A(1)2.0 DO I  2,N-1 A(I)  B(I)C(I) 
D(I1)  A(I)2.0 ENDDO A(N)  B(N)C(N) 
 12Alignment
- If an array is involved in a recurrence, then 
alignment isnt possible.  - If two dependencies between the same statements 
have different dependency distances, then 
alignment doesnt work.  - We can fix the second case by replicating code 
 
DO I  1,N A(I1)  B(I)C ! Replicated 
Statement IF (I .EQ 1) THEN t  A(I) 
ELSE t  B(I-1)C END IF X(I)  
A(I1)t ENDDO
DO I  1,N A(I1)  B(I)C X(I)  
A(I1)A(I) ENDDO 
 13Alignment
Theorem Alignment, replication, and statement 
reordering are sufficient to eliminate all 
carried dependencies in a single loop containing 
no recurrence, and in which the distance of each 
dependence is a constant independent of the loop 
index 
-  We can establish this constructively. 
 -  Let G  (V,E,?) be a weighted graph. v ? V is a 
statement, and ?(v1, v2) is the dependence 
distance between v1 and v2. Let o V ?Z give the 
offset of vertices.  -  G is said to be carry free if o(v1)  ?(v1, v2) 
 o(v2).  
  14Alignment Procedure
procedure Align(V,E,?,0) While V is not 
empty remove element v from V 
for each (w,v) ? E if w ? V 
 W ? W ? w o(w) ? o(v) - ?(w,v) 
 else if o(w) ! o(v) - ?(w,v) 
 create vertex w 
replace (w,v) with (w,v) 
replicate all edges into w onto w 
 W ? W ? w o(w) ? o(v) 
- ?(w,v)
for each (v,w) ? E if w ? V W ? W ?w 
 o(w) ? o(v)  ?(v,w) else if o(w) ! o(v) 
 ?(v,w) create vertex v replace 
(v,w) with (v,w) replicate edges into v 
onto v W ? W ? v o(v) ? o(w) - 
?(v,w) end align 
 15Loop Fusion
- Loop distribution was a method for separating 
parallel parts of a loop.  - Our solution attempted to find the maximal loop 
distribution.  - The maximal distribution often finds 
parallelizable components to small for efficient 
parallelizing.  -  Two obvious solutions 
 -  Strip mine large loops to create larger 
granularity.  -  Perform maximal distribution, and fuse together 
parallelizable loops. 
  16Fusion Safety
Definition A loop-independent dependence between 
statements S1 and S2 in loops L1 and L2 
respectively is fusion-preventing if fusing L1 
and L2 causes the dependence to be carried by the 
combined loop in the opposite direction.
 DO I  1,N S1 A(I)  B(I)C ENDDO DO 
I  1,N S2 D(I)  A(I1)E ENDDO
 DO I  1,N S1 A(I)  B(I)C S2 D(I)  
A(I1)E ENDDO 
 17Fusion Safety
- We shouldnt fuse loops if the fusing will 
violate ordering of the dependence graph.  - Ordering Constraint Two loops cant be validly 
fused if there exists a path of loop-independent 
dependencies between them containing a loop or 
statement not being fused with them. 
Fusing L1 with L3 violates the ordering 
constraint. L1,L3 must occur both before and 
after the node L2.  
 18Fusion Profitability
Parallel loops should generally not be merged 
with sequential loops. Definition An edge 
between two statements in loops L1 and L2 
respectively is said to be parallelism-inhibiting 
if after merging L1 and L2, the dependence is 
carried by the combined loop.
 DO I  1,N S1 A(I1)  B(I)  C ENDDO 
 DO I  1,N S2 D(I)  A(I)  E ENDDO
 DO I  1,N S1 A(I1)  B(I)  C S2 D(I)  
A(I)  E ENDDO 
 19Typed Fusion
- We start off by classifying loops into two types 
parallel and sequential.  - We next gather together all edges that inhibit 
efficient fusion, and call them bad edges.  - Given a loop dependency graph (V,E), we want to 
obtain a graph (V,E) by merging vertices of V 
subject to the following constraints  -  Bad Edge Constraint vertices joined by a bad 
edge arent fused.  -  Ordering Constraint vertices joined by path 
containing non-parallel vertex arent fused 
  20Typed Fusion Procedure
procedure TypedFusion(G,T,type,B,t0) 
Initialize all variables to zero Set countn 
to be the in-degree of node n Initialize W 
with all nodes with in-degree zero while W 
isnt empty remove element n with type t 
from W if t  t0 if 
maxBadPrevn  0 then p ? fused else 
p ? nextmaxBadPrevn if p ! 0 
then x ? nodep numn ? numx update_success
ors(n,t) fuse x and n and call the result n 
 else create_new_fused_node(n) 
 update_successors(n,t) else 
create_new_node(n) update_successors(n,t) e
nd TypedFusion 
 21Typed Fusion Example
Original loop graph
Graph annotated (maxBadPrev,p) ? num
After fusing parallel loops
After fusing sequential loops 
 22Cohort Fusion
-  Given an outer loop containing some number of 
inner loops, we want to be able to run some inner 
loops in parallel.  -  We can do this as follows 
 -  Run TypedFusion with B  
fusion-preventing edges, parallelism-inhibiting
 edges, and edges between a parallel loop and a 
sequential loop  -  Put a barrier at the end of each identified 
cohort  -  Run TypedFusion again to fuse the parallel loops 
in each cohort