Runtime Parallelization: Its Time Has Come presentation

About This Presentation

Transcript and Presenter's Notes

Title: Runtime Parallelization: Its Time Has Come

1
Run-time Parallelization Its Time Has Come

Lawrence Rauchwerger
Presented by Burcea Mihai

2
Abstract

Parallel programs performance gain loop
parallelization
Insufficiently defined access patterns reduce
possibility of exploiting benefits to the max
Complement that with techniques based on run-time
data dependence analysis

3
Agenda

Loop parallelization description, techniques,
issues
Approaches to run-time parallelization
Run-time techniques for partially parallel loops
Run-time techniques for fully parallel loops
Framework for implementing runtime testing

4
Automatic Parallelization Issues

Three ways to achieve parallel performance
Good parallel algorithms
Standard parallel language for portable parallel
programming
Compilers to parallelize sequential programs and
to optimize parallel programs

5
Automatic Parallelization Issues (cont.d)

1 and 2 - difficult to achieve why ?
Best option parallelizing compiler to optimize
both legacy and modern code
Should do
Parallelism detection
Parallelism exploitation

6
Run-time Parallelization

First task data dependence analysis goal ?
Two classes of applications
Regular programs
Irregular programs
Static memory patterns
Dynamic memory patterns

7
Runtime Parallelization Why?

50 of applications are irregular
For irregular applications and complex (even
statically) access patterns, parallelizing
compilers dont do a good job
Solution optimize at runtime when more info is
available

8
Runtime Parallelization

Examples
Input dependent / dynamic data distribution
Memory accesses guarded by run-time guard
conditions
Subscript expressions
We discuss techniques for loop parallelism
optimization in Fortran programs for shared
memory architectures

9
Loop Parallelization - Concepts

Doall loops
Partially parallel loops some synchronization
needed
Doacross (pipelined) loops

10
Data Dependency Types

Flow (read after write) (e.g. producer-consumer)
Anti (write after read)
Output (write after write)

11
Dependency Types Examples

Anti or output dependencies, should be removed
before the loop can be executed in parallel

12
Parallelism Enabling Transformations

Privatization create private copies on each
processor
Reduction variables and associative operations
(see previous c) example)
Reduction implies
Recognizing the reduction variable, and
Parallelizing the reduction operation

13
Reduction Techniques

For commutative operations transform the do loop
into a doall and enclose the access in a critical
region
Cons not always scalable requires
synchronization, which can be expensive
Scalable solution privatization interprocessor
reduction phase

14
Partially Parallel Loops

Use DAGs cpl critical path length
If cpl of iterations loop is sequential
Otherwise partially parallel loop
General technique for parallel loops
Remove anti output dependences
Execute in parallel by using cross-iteration
synchronizations to enforce flow dependences

15
Partially Parallel Loops

Sequential loops with constant dependence
distance overlap segments of their iterations
Statically analyzable doacross loops exploited by
using post and await instructions

16
Partially Parallel Loops

Have their iteration space partitioned in
disjoint sets of mutually independent iterations
(wavefronts)
Execute wavefronts as a chain of doalls separated
by global synchronization
Or use independent threads of dependent
iterations
These methods require very detailed info about
data dependence in the loop

17
Conclusion To These Techniques

Some employed by statically defined pattern
matching reduced recognition capabilities
Other require information that is not available
at compile-time (access pattern) (see irregular
applications)
Morale we should use run-time techniques

18
General Run-Time Techniques

For parallelization of both doall loops and
partially parallel loops
Inspector/Executor methods extract an inspector
loop, and it traverses the access pattern without
actually modifying the shared data before the
actual computation takes place
Speculative methods perform inspection of the
shared memory accesses during the parallel
execution of the loop under test

19
RTT for Partially Parallel Loops

Simple methods generate 2-version loops
Other methods
1. Methods using Critical Sections
A) Add an iteration to the current wavefront if
no data accessed in that iteration is accessed by
any lower unassigned iteration
Find the lowest unassigned iteration that
accesses any array elements using atomic
compare-and-swap directives and a shadow array

20
RTT for Partially Parallel Loops

Pros small memory requirements
Cons may only behave well for small values of
cpl and access patterns without hot spots
B) parallel inspector
Allocate global space to remove all anti and
output dependences
Build a dependence graph and map all memory
accesses to our array
Use synch to enforce flow dep

21
RTT for Partially Parallel Loops

C) Inspector with a private phase and a merging
phase
Private phase chunk the loop among CPUs and each
CPU builds a list with the accesses in its
assigned iterations
Links for each memory location are linked across
processors via a global algorithm
Scheduler/executor does doacross parallelization

22
RTT for Partially Parallel Loops

2. Methods for loops w/out output Deps.
A) most of them transform original source loop
into an inspector
Inspector does run-time data dependence analysis
and constructs a schedule
Executor executes the schedule
Original source loop should not have output
dependences !
Inspector does a sequential topological sort of
the accesses in the loop

23
RTT for Partially Parallel Loops

Improvements by parallelizing the toposort
I. Assign iterations to CPUs in a wrapped manner
use busy waiting
II. Sectioning Requires synchronization (barrier
for each CPU)
III. Bootstrapping the inspector is parallelized
instead, by using the sectioning method

24
RTT for Partially Parallel Loops

3. A scalable method for loop Parallelization
Inspector phase
Scheduler generates wavefronts
Executor executes the wavefronts
Scheduler and executor can work in tandem or in
interleaved mode (or generate independent
threads)

25
RTT For Fully Parallel (doall) Loops

Doall loops - more scalability potential
We describe the LRPD test detects fully parallel
loops and can validate privatization and
reduction parallelization
Detects presence of cross-iteration dependencies,
does not identify them
Need only be applied to scalars and arrays that
cant be analyzed at compile-time

26
The LRPD Test

Important source of ambiguity the runtime
equivalent of dead code
LRPD test checks only the dynamic data
dependences caused by the actual cross-iteration
flow
This is done via dynamic dead reference
elimination

27
The Lazy (Value-based) Privatizing doall Test
(LPD)

1. Marking phase
Use Ar, Aw, and Anp
Mark reads writes accordingly, count writes
2. Analysis Phase
Compute tw(A) and tm(A)
If any(Aw Ar), loop is not a doall
If tw(A) tm(A), loop is a doall
If any(Aw Anp), loop is not a doall
Otherwise, the loop can be transformed into a
doall by privatizing the shared array

28
LPD Dynamic Dead Reference Elimination

Postpone marking of Ar and Anp until the value of
the shared variable is actually used
Dynamic dead read reference is a read access of a
shared variable that both
Does not contribute to the computation of any
other shared variable, and
Does not control (predicate) the references to
other shared variables

29
LRPD Extending LPD for Reduction Validation

Compile-time pattern matching is weak
Data dependence analysis for reduction detection
cant be done statically in presence of
input-dependent access patterns
Syntactic pattern matching cant identify all
potential reduction variables (e.g. subscripted
subscripts)

30
LRPD Extending LPD for Reduction Validation

Use an extra array Anx flag non valid reduction
variables
A variable is not valid for reduction if it was
either defined (written) or used (read) outside
the reduction statement

31
Complexity and Implementation Issues

Private shadow structures
Processor-wise version of the LRPD test
Complexity
Time linear with total iteration count of the
loop and of elements in the shared array
Space linear with of elements in the shared
array

32
Inspector/Executor vs. Speculative

Inspector/Executor needs a proper inspector loop
to be extracted
For many applications that is not possible
Solution Speculative
But must generate additional code for saving and
restoring state
Both methods imply 2 version loops serial will
be executed if loop is not a doall

33
Experimental Results
34
Experimental Results
35
Framework for Implementing Run-time Testing

At compile time
A) Cost / benefit analysis to determine how the
loop should be treated
B) decide upon any loop transformations based on
A)
C) generate code accordingly

36
Framework for Implementing Run-time Testing

At run-time
A) checkpoint if necessary
B) execute runtime technique in parallel
C) collect statistics

37
Related Work

1. Race detection
Race conditions and access anomalies detection
statically, based on an execution trace, or at
run-time
Expensive in terms of
Memory requirements
Execution time overhead
2. Optimistic execution the virtual time
concept
Applicable to database and discrete event
simulation rather than loop parallelization

38
Future Directions

There is need for new techniques
Make more aggressive use of runtime optimizations
Need more experimental, statistical data about
parallel profiles of dynamic codes
Architectural support will help reduce the
run-time overhead

Write a Comment

User Comments (0)

About PowerShow.com

Runtime Parallelization: Its Time Has Come PowerPoint PPT Presentation