Title: Runtime Parallelization: Its Time Has Come
 1Run-time Parallelization Its Time Has Come
- Lawrence Rauchwerger 
 - Presented by Burcea Mihai
 
  2Abstract
- Parallel programs performance gain loop 
parallelization  - Insufficiently defined access patterns reduce 
possibility of exploiting benefits to the max  - Complement that with techniques based on run-time 
data dependence analysis 
  3Agenda
- Loop parallelization description, techniques, 
issues  - Approaches to run-time parallelization 
 - Run-time techniques for partially parallel loops 
 - Run-time techniques for fully parallel loops 
 - Framework for implementing runtime testing
 
  4Automatic Parallelization Issues
- Three ways to achieve parallel performance 
 - Good parallel algorithms 
 - Standard parallel language for portable parallel 
programming  - Compilers to parallelize sequential programs and 
to optimize parallel programs 
  5Automatic Parallelization Issues (cont.d)
- 1 and 2 - difficult to achieve why ? 
 - Best option parallelizing compiler to optimize 
both legacy and modern code  - Should do  
 - Parallelism detection 
 - Parallelism exploitation
 
  6Run-time Parallelization
- First task data dependence analysis goal ? 
 - Two classes of applications 
 - Regular programs 
 - Irregular programs 
 - Static memory patterns 
 - Dynamic memory patterns
 
  7Runtime Parallelization  Why?
- 50  of applications are irregular 
 - For irregular applications and complex (even 
statically) access patterns, parallelizing 
compilers dont do a good job  - Solution optimize at runtime when more info is 
available 
  8Runtime Parallelization
- Examples 
 - Input dependent / dynamic data distribution 
 - Memory accesses guarded by run-time guard 
conditions  - Subscript expressions 
 - We discuss techniques for loop parallelism 
optimization in Fortran programs for shared 
memory architectures  
  9Loop Parallelization - Concepts
- Doall loops 
 - Partially parallel loops  some synchronization 
needed  - Doacross (pipelined) loops
 
  10Data Dependency Types
- Flow (read after write) (e.g. producer-consumer) 
 - Anti (write after read) 
 - Output (write after write)
 
  11Dependency Types Examples
- Anti or output dependencies, should be removed 
before the loop can be executed in parallel  
  12Parallelism Enabling Transformations
- Privatization create private copies on each 
processor  - Reduction variables and associative operations 
(see previous c) example)  - Reduction implies 
 - Recognizing the reduction variable, and 
 - Parallelizing the reduction operation
 
  13Reduction Techniques
- For commutative operations transform the do loop 
into a doall and enclose the access in a critical 
region  - Cons not always scalable requires 
synchronization, which can be expensive  - Scalable solution privatization  interprocessor 
reduction phase 
  14Partially Parallel Loops
- Use DAGs cpl  critical path length 
 - If cpl   of iterations loop is sequential 
 - Otherwise partially parallel loop 
 - General technique for parallel loops 
 - Remove anti  output dependences 
 - Execute in parallel by using cross-iteration 
synchronizations to enforce flow dependences 
  15Partially Parallel Loops
- Sequential loops with constant dependence 
distance overlap segments of their iterations  - Statically analyzable doacross loops exploited by 
using post and await instructions 
  16Partially Parallel Loops
- Have their iteration space partitioned in 
disjoint sets of mutually independent iterations 
(wavefronts)  - Execute wavefronts as a chain of doalls separated 
by global synchronization  - Or use independent threads of dependent 
iterations  - These methods require very detailed info about 
data dependence in the loop 
  17Conclusion To These Techniques
- Some employed by statically defined pattern 
matching  reduced recognition capabilities  - Other require information that is not available 
at compile-time (access pattern) (see irregular 
applications)  - Morale we should use run-time techniques
 
  18General Run-Time Techniques
- For parallelization of both doall loops and 
partially parallel loops  - Inspector/Executor methods extract an inspector 
loop, and it traverses the access pattern without 
actually modifying the shared data before the 
actual computation takes place  - Speculative methods perform inspection of the 
shared memory accesses during the parallel 
execution of the loop under test 
  19RTT for Partially Parallel Loops
- Simple methods generate 2-version loops 
 - Other methods 
 - 1. Methods using Critical Sections 
 - A) Add an iteration to the current wavefront if 
no data accessed in that iteration is accessed by 
any lower unassigned iteration  - Find the lowest unassigned iteration that 
accesses any array elements using atomic 
compare-and-swap directives and a shadow array 
  20RTT for Partially Parallel Loops
- Pros small memory requirements 
 - Cons may only behave well for small values of 
cpl and access patterns without hot spots  - B) parallel inspector 
 - Allocate global space to remove all anti and 
output dependences  - Build a dependence graph and map all memory 
accesses to our array  - Use synch to enforce flow dep
 
  21RTT for Partially Parallel Loops
- C)  Inspector with a private phase and a merging 
phase  - Private phase chunk the loop among CPUs and each 
CPU builds a list with the accesses in its 
assigned iterations  - Links for each memory location are linked across 
processors via a global algorithm  - Scheduler/executor does doacross parallelization
 
  22RTT for Partially Parallel Loops
- 2. Methods for loops w/out output Deps. 
 - A)  most of them transform original source loop 
into an inspector  - Inspector does run-time data dependence analysis 
and constructs a schedule  - Executor executes the schedule 
 - Original source loop should not have output 
dependences !  - Inspector does a sequential topological sort of 
the accesses in the loop  
  23RTT for Partially Parallel Loops
- Improvements by parallelizing the toposort 
 - I. Assign iterations to CPUs in a wrapped manner 
 use busy waiting  - II. Sectioning Requires synchronization (barrier 
for each CPU)  - III. Bootstrapping the inspector is parallelized 
instead, by using the sectioning method 
  24RTT for Partially Parallel Loops
- 3. A scalable method for loop Parallelization 
 - Inspector phase 
 - Scheduler generates wavefronts 
 - Executor executes the wavefronts 
 - Scheduler and executor can work in tandem or in 
interleaved mode (or generate independent 
threads)  
  25RTT For Fully Parallel (doall) Loops
- Doall loops - more scalability potential 
 - We describe the LRPD test detects fully parallel 
loops and can validate privatization and 
reduction parallelization  - Detects presence of cross-iteration dependencies, 
does not identify them  - Need only be applied to scalars and arrays that 
cant be analyzed at compile-time 
  26The LRPD Test
- Important source of ambiguity the runtime 
equivalent of dead code  - LRPD test checks only the dynamic data 
dependences caused by the actual cross-iteration 
flow  - This is done via dynamic dead reference 
elimination  
  27The Lazy (Value-based) Privatizing doall Test 
(LPD)
- 1. Marking phase 
 - Use Ar, Aw, and Anp 
 - Mark reads  writes accordingly, count writes 
 - 2. Analysis Phase 
 - Compute tw(A) and tm(A) 
 - If any(Aw  Ar), loop is not a doall 
 - If tw(A)  tm(A), loop is a doall 
 - If any(Aw  Anp), loop is not a doall 
 - Otherwise, the loop can be transformed into a 
doall by privatizing the shared array  
  28LPD  Dynamic Dead Reference Elimination
- Postpone marking of Ar and Anp until the value of 
the shared variable is actually used  - Dynamic dead read reference is a read access of a 
shared variable that both  - Does not contribute to the computation of any 
other shared variable, and  - Does not control (predicate) the references to 
other shared variables 
  29LRPD  Extending LPD for Reduction Validation
- Compile-time pattern matching is weak 
 - Data dependence analysis for reduction detection 
cant be done statically in presence of 
input-dependent access patterns  - Syntactic pattern matching cant identify all 
potential reduction variables (e.g. subscripted 
subscripts) 
  30LRPD  Extending LPD for Reduction Validation
- Use an extra array Anx  flag non valid reduction 
variables  - A variable is not valid for reduction if it was 
either defined (written) or used (read) outside 
the reduction statement 
  31Complexity and Implementation Issues
- Private shadow structures 
 - Processor-wise version of the LRPD test 
 - Complexity 
 - Time linear with total iteration count of the 
loop and  of elements in the shared array  - Space linear with  of elements in the shared 
array  
  32Inspector/Executor vs. Speculative
- Inspector/Executor needs a proper inspector loop 
to be extracted  - For many applications that is not possible 
 - Solution Speculative 
 - But must generate additional code for saving and 
restoring state  - Both methods imply 2 version loops serial will 
be executed if loop is not a doall 
  33Experimental Results 
 34Experimental Results 
 35Framework for Implementing Run-time Testing
- At compile time 
 - A) Cost / benefit analysis to determine how the 
loop should be treated  - B) decide upon any loop transformations based on 
A)  - C) generate code accordingly
 
  36Framework for Implementing Run-time Testing
- At run-time 
 - A) checkpoint if necessary 
 - B) execute runtime technique in parallel 
 - C) collect statistics
 
  37Related Work
- 1. Race detection 
 - Race conditions and access anomalies detection 
statically, based on an execution trace, or at 
run-time  - Expensive in terms of 
 - Memory requirements 
 - Execution time overhead 
 - 2. Optimistic execution  the virtual time 
concept  - Applicable to database and discrete event 
simulation rather than loop parallelization 
  38Future Directions
- There is need for new techniques 
 - Make more aggressive use of runtime optimizations 
 - Need more experimental, statistical data about 
parallel profiles of dynamic codes  - Architectural support will help reduce the 
run-time overhead