Title: Runtime Parallelization: Its Time Has Come
1Run-time Parallelization Its Time Has Come
- Lawrence Rauchwerger
- Presented by Burcea Mihai
2Abstract
- Parallel programs performance gain loop
parallelization - Insufficiently defined access patterns reduce
possibility of exploiting benefits to the max - Complement that with techniques based on run-time
data dependence analysis
3Agenda
- Loop parallelization description, techniques,
issues - Approaches to run-time parallelization
- Run-time techniques for partially parallel loops
- Run-time techniques for fully parallel loops
- Framework for implementing runtime testing
4Automatic Parallelization Issues
- Three ways to achieve parallel performance
- Good parallel algorithms
- Standard parallel language for portable parallel
programming - Compilers to parallelize sequential programs and
to optimize parallel programs
5Automatic Parallelization Issues (cont.d)
- 1 and 2 - difficult to achieve why ?
- Best option parallelizing compiler to optimize
both legacy and modern code - Should do
- Parallelism detection
- Parallelism exploitation
6Run-time Parallelization
- First task data dependence analysis goal ?
- Two classes of applications
- Regular programs
- Irregular programs
- Static memory patterns
- Dynamic memory patterns
7Runtime Parallelization Why?
- 50 of applications are irregular
- For irregular applications and complex (even
statically) access patterns, parallelizing
compilers dont do a good job - Solution optimize at runtime when more info is
available
8Runtime Parallelization
- Examples
- Input dependent / dynamic data distribution
- Memory accesses guarded by run-time guard
conditions - Subscript expressions
- We discuss techniques for loop parallelism
optimization in Fortran programs for shared
memory architectures
9Loop Parallelization - Concepts
- Doall loops
- Partially parallel loops some synchronization
needed - Doacross (pipelined) loops
10Data Dependency Types
- Flow (read after write) (e.g. producer-consumer)
- Anti (write after read)
- Output (write after write)
11Dependency Types Examples
- Anti or output dependencies, should be removed
before the loop can be executed in parallel
12Parallelism Enabling Transformations
- Privatization create private copies on each
processor - Reduction variables and associative operations
(see previous c) example) - Reduction implies
- Recognizing the reduction variable, and
- Parallelizing the reduction operation
13Reduction Techniques
- For commutative operations transform the do loop
into a doall and enclose the access in a critical
region - Cons not always scalable requires
synchronization, which can be expensive - Scalable solution privatization interprocessor
reduction phase
14Partially Parallel Loops
- Use DAGs cpl critical path length
- If cpl of iterations loop is sequential
- Otherwise partially parallel loop
- General technique for parallel loops
- Remove anti output dependences
- Execute in parallel by using cross-iteration
synchronizations to enforce flow dependences
15Partially Parallel Loops
- Sequential loops with constant dependence
distance overlap segments of their iterations - Statically analyzable doacross loops exploited by
using post and await instructions
16Partially Parallel Loops
- Have their iteration space partitioned in
disjoint sets of mutually independent iterations
(wavefronts) - Execute wavefronts as a chain of doalls separated
by global synchronization - Or use independent threads of dependent
iterations - These methods require very detailed info about
data dependence in the loop
17Conclusion To These Techniques
- Some employed by statically defined pattern
matching reduced recognition capabilities - Other require information that is not available
at compile-time (access pattern) (see irregular
applications) - Morale we should use run-time techniques
18General Run-Time Techniques
- For parallelization of both doall loops and
partially parallel loops - Inspector/Executor methods extract an inspector
loop, and it traverses the access pattern without
actually modifying the shared data before the
actual computation takes place - Speculative methods perform inspection of the
shared memory accesses during the parallel
execution of the loop under test
19RTT for Partially Parallel Loops
- Simple methods generate 2-version loops
- Other methods
- 1. Methods using Critical Sections
- A) Add an iteration to the current wavefront if
no data accessed in that iteration is accessed by
any lower unassigned iteration - Find the lowest unassigned iteration that
accesses any array elements using atomic
compare-and-swap directives and a shadow array
20RTT for Partially Parallel Loops
- Pros small memory requirements
- Cons may only behave well for small values of
cpl and access patterns without hot spots - B) parallel inspector
- Allocate global space to remove all anti and
output dependences - Build a dependence graph and map all memory
accesses to our array - Use synch to enforce flow dep
21RTT for Partially Parallel Loops
- C) Inspector with a private phase and a merging
phase - Private phase chunk the loop among CPUs and each
CPU builds a list with the accesses in its
assigned iterations - Links for each memory location are linked across
processors via a global algorithm - Scheduler/executor does doacross parallelization
22RTT for Partially Parallel Loops
- 2. Methods for loops w/out output Deps.
- A) most of them transform original source loop
into an inspector - Inspector does run-time data dependence analysis
and constructs a schedule - Executor executes the schedule
- Original source loop should not have output
dependences ! - Inspector does a sequential topological sort of
the accesses in the loop
23RTT for Partially Parallel Loops
- Improvements by parallelizing the toposort
- I. Assign iterations to CPUs in a wrapped manner
use busy waiting - II. Sectioning Requires synchronization (barrier
for each CPU) - III. Bootstrapping the inspector is parallelized
instead, by using the sectioning method
24RTT for Partially Parallel Loops
- 3. A scalable method for loop Parallelization
- Inspector phase
- Scheduler generates wavefronts
- Executor executes the wavefronts
- Scheduler and executor can work in tandem or in
interleaved mode (or generate independent
threads)
25RTT For Fully Parallel (doall) Loops
- Doall loops - more scalability potential
- We describe the LRPD test detects fully parallel
loops and can validate privatization and
reduction parallelization - Detects presence of cross-iteration dependencies,
does not identify them - Need only be applied to scalars and arrays that
cant be analyzed at compile-time
26The LRPD Test
- Important source of ambiguity the runtime
equivalent of dead code - LRPD test checks only the dynamic data
dependences caused by the actual cross-iteration
flow - This is done via dynamic dead reference
elimination
27The Lazy (Value-based) Privatizing doall Test
(LPD)
- 1. Marking phase
- Use Ar, Aw, and Anp
- Mark reads writes accordingly, count writes
- 2. Analysis Phase
- Compute tw(A) and tm(A)
- If any(Aw Ar), loop is not a doall
- If tw(A) tm(A), loop is a doall
- If any(Aw Anp), loop is not a doall
- Otherwise, the loop can be transformed into a
doall by privatizing the shared array
28LPD Dynamic Dead Reference Elimination
- Postpone marking of Ar and Anp until the value of
the shared variable is actually used - Dynamic dead read reference is a read access of a
shared variable that both - Does not contribute to the computation of any
other shared variable, and - Does not control (predicate) the references to
other shared variables
29LRPD Extending LPD for Reduction Validation
- Compile-time pattern matching is weak
- Data dependence analysis for reduction detection
cant be done statically in presence of
input-dependent access patterns - Syntactic pattern matching cant identify all
potential reduction variables (e.g. subscripted
subscripts)
30LRPD Extending LPD for Reduction Validation
- Use an extra array Anx flag non valid reduction
variables - A variable is not valid for reduction if it was
either defined (written) or used (read) outside
the reduction statement
31Complexity and Implementation Issues
- Private shadow structures
- Processor-wise version of the LRPD test
- Complexity
- Time linear with total iteration count of the
loop and of elements in the shared array - Space linear with of elements in the shared
array
32Inspector/Executor vs. Speculative
- Inspector/Executor needs a proper inspector loop
to be extracted - For many applications that is not possible
- Solution Speculative
- But must generate additional code for saving and
restoring state - Both methods imply 2 version loops serial will
be executed if loop is not a doall
33Experimental Results
34Experimental Results
35Framework for Implementing Run-time Testing
- At compile time
- A) Cost / benefit analysis to determine how the
loop should be treated - B) decide upon any loop transformations based on
A) - C) generate code accordingly
36Framework for Implementing Run-time Testing
- At run-time
- A) checkpoint if necessary
- B) execute runtime technique in parallel
- C) collect statistics
37Related Work
- 1. Race detection
- Race conditions and access anomalies detection
statically, based on an execution trace, or at
run-time - Expensive in terms of
- Memory requirements
- Execution time overhead
- 2. Optimistic execution the virtual time
concept - Applicable to database and discrete event
simulation rather than loop parallelization
38Future Directions
- There is need for new techniques
- Make more aggressive use of runtime optimizations
- Need more experimental, statistical data about
parallel profiles of dynamic codes - Architectural support will help reduce the
run-time overhead