Runtime Parallelization: Its Time Has Come - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Runtime Parallelization: Its Time Has Come

Description:

Insufficiently defined access patterns reduce possibility of exploiting benefits ... transform the do loop into a doall and enclose the access in a critical region ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 39
Provided by: goth2
Category:

less

Transcript and Presenter's Notes

Title: Runtime Parallelization: Its Time Has Come


1
Run-time Parallelization Its Time Has Come
  • Lawrence Rauchwerger
  • Presented by Burcea Mihai

2
Abstract
  • Parallel programs performance gain loop
    parallelization
  • Insufficiently defined access patterns reduce
    possibility of exploiting benefits to the max
  • Complement that with techniques based on run-time
    data dependence analysis

3
Agenda
  • Loop parallelization description, techniques,
    issues
  • Approaches to run-time parallelization
  • Run-time techniques for partially parallel loops
  • Run-time techniques for fully parallel loops
  • Framework for implementing runtime testing

4
Automatic Parallelization Issues
  • Three ways to achieve parallel performance
  • Good parallel algorithms
  • Standard parallel language for portable parallel
    programming
  • Compilers to parallelize sequential programs and
    to optimize parallel programs

5
Automatic Parallelization Issues (cont.d)
  • 1 and 2 - difficult to achieve why ?
  • Best option parallelizing compiler to optimize
    both legacy and modern code
  • Should do
  • Parallelism detection
  • Parallelism exploitation

6
Run-time Parallelization
  • First task data dependence analysis goal ?
  • Two classes of applications
  • Regular programs
  • Irregular programs
  • Static memory patterns
  • Dynamic memory patterns

7
Runtime Parallelization Why?
  • 50 of applications are irregular
  • For irregular applications and complex (even
    statically) access patterns, parallelizing
    compilers dont do a good job
  • Solution optimize at runtime when more info is
    available

8
Runtime Parallelization
  • Examples
  • Input dependent / dynamic data distribution
  • Memory accesses guarded by run-time guard
    conditions
  • Subscript expressions
  • We discuss techniques for loop parallelism
    optimization in Fortran programs for shared
    memory architectures

9
Loop Parallelization - Concepts
  • Doall loops
  • Partially parallel loops some synchronization
    needed
  • Doacross (pipelined) loops

10
Data Dependency Types
  • Flow (read after write) (e.g. producer-consumer)
  • Anti (write after read)
  • Output (write after write)

11
Dependency Types Examples
  • Anti or output dependencies, should be removed
    before the loop can be executed in parallel

12
Parallelism Enabling Transformations
  • Privatization create private copies on each
    processor
  • Reduction variables and associative operations
    (see previous c) example)
  • Reduction implies
  • Recognizing the reduction variable, and
  • Parallelizing the reduction operation

13
Reduction Techniques
  • For commutative operations transform the do loop
    into a doall and enclose the access in a critical
    region
  • Cons not always scalable requires
    synchronization, which can be expensive
  • Scalable solution privatization interprocessor
    reduction phase

14
Partially Parallel Loops
  • Use DAGs cpl critical path length
  • If cpl of iterations loop is sequential
  • Otherwise partially parallel loop
  • General technique for parallel loops
  • Remove anti output dependences
  • Execute in parallel by using cross-iteration
    synchronizations to enforce flow dependences

15
Partially Parallel Loops
  • Sequential loops with constant dependence
    distance overlap segments of their iterations
  • Statically analyzable doacross loops exploited by
    using post and await instructions

16
Partially Parallel Loops
  • Have their iteration space partitioned in
    disjoint sets of mutually independent iterations
    (wavefronts)
  • Execute wavefronts as a chain of doalls separated
    by global synchronization
  • Or use independent threads of dependent
    iterations
  • These methods require very detailed info about
    data dependence in the loop

17
Conclusion To These Techniques
  • Some employed by statically defined pattern
    matching reduced recognition capabilities
  • Other require information that is not available
    at compile-time (access pattern) (see irregular
    applications)
  • Morale we should use run-time techniques

18
General Run-Time Techniques
  • For parallelization of both doall loops and
    partially parallel loops
  • Inspector/Executor methods extract an inspector
    loop, and it traverses the access pattern without
    actually modifying the shared data before the
    actual computation takes place
  • Speculative methods perform inspection of the
    shared memory accesses during the parallel
    execution of the loop under test

19
RTT for Partially Parallel Loops
  • Simple methods generate 2-version loops
  • Other methods
  • 1. Methods using Critical Sections
  • A) Add an iteration to the current wavefront if
    no data accessed in that iteration is accessed by
    any lower unassigned iteration
  • Find the lowest unassigned iteration that
    accesses any array elements using atomic
    compare-and-swap directives and a shadow array

20
RTT for Partially Parallel Loops
  • Pros small memory requirements
  • Cons may only behave well for small values of
    cpl and access patterns without hot spots
  • B) parallel inspector
  • Allocate global space to remove all anti and
    output dependences
  • Build a dependence graph and map all memory
    accesses to our array
  • Use synch to enforce flow dep

21
RTT for Partially Parallel Loops
  • C) Inspector with a private phase and a merging
    phase
  • Private phase chunk the loop among CPUs and each
    CPU builds a list with the accesses in its
    assigned iterations
  • Links for each memory location are linked across
    processors via a global algorithm
  • Scheduler/executor does doacross parallelization

22
RTT for Partially Parallel Loops
  • 2. Methods for loops w/out output Deps.
  • A) most of them transform original source loop
    into an inspector
  • Inspector does run-time data dependence analysis
    and constructs a schedule
  • Executor executes the schedule
  • Original source loop should not have output
    dependences !
  • Inspector does a sequential topological sort of
    the accesses in the loop

23
RTT for Partially Parallel Loops
  • Improvements by parallelizing the toposort
  • I. Assign iterations to CPUs in a wrapped manner
    use busy waiting
  • II. Sectioning Requires synchronization (barrier
    for each CPU)
  • III. Bootstrapping the inspector is parallelized
    instead, by using the sectioning method

24
RTT for Partially Parallel Loops
  • 3. A scalable method for loop Parallelization
  • Inspector phase
  • Scheduler generates wavefronts
  • Executor executes the wavefronts
  • Scheduler and executor can work in tandem or in
    interleaved mode (or generate independent
    threads)

25
RTT For Fully Parallel (doall) Loops
  • Doall loops - more scalability potential
  • We describe the LRPD test detects fully parallel
    loops and can validate privatization and
    reduction parallelization
  • Detects presence of cross-iteration dependencies,
    does not identify them
  • Need only be applied to scalars and arrays that
    cant be analyzed at compile-time

26
The LRPD Test
  • Important source of ambiguity the runtime
    equivalent of dead code
  • LRPD test checks only the dynamic data
    dependences caused by the actual cross-iteration
    flow
  • This is done via dynamic dead reference
    elimination

27
The Lazy (Value-based) Privatizing doall Test
(LPD)
  • 1. Marking phase
  • Use Ar, Aw, and Anp
  • Mark reads writes accordingly, count writes
  • 2. Analysis Phase
  • Compute tw(A) and tm(A)
  • If any(Aw Ar), loop is not a doall
  • If tw(A) tm(A), loop is a doall
  • If any(Aw Anp), loop is not a doall
  • Otherwise, the loop can be transformed into a
    doall by privatizing the shared array

28
LPD Dynamic Dead Reference Elimination
  • Postpone marking of Ar and Anp until the value of
    the shared variable is actually used
  • Dynamic dead read reference is a read access of a
    shared variable that both
  • Does not contribute to the computation of any
    other shared variable, and
  • Does not control (predicate) the references to
    other shared variables

29
LRPD Extending LPD for Reduction Validation
  • Compile-time pattern matching is weak
  • Data dependence analysis for reduction detection
    cant be done statically in presence of
    input-dependent access patterns
  • Syntactic pattern matching cant identify all
    potential reduction variables (e.g. subscripted
    subscripts)

30
LRPD Extending LPD for Reduction Validation
  • Use an extra array Anx flag non valid reduction
    variables
  • A variable is not valid for reduction if it was
    either defined (written) or used (read) outside
    the reduction statement

31
Complexity and Implementation Issues
  • Private shadow structures
  • Processor-wise version of the LRPD test
  • Complexity
  • Time linear with total iteration count of the
    loop and of elements in the shared array
  • Space linear with of elements in the shared
    array

32
Inspector/Executor vs. Speculative
  • Inspector/Executor needs a proper inspector loop
    to be extracted
  • For many applications that is not possible
  • Solution Speculative
  • But must generate additional code for saving and
    restoring state
  • Both methods imply 2 version loops serial will
    be executed if loop is not a doall

33
Experimental Results
34
Experimental Results
35
Framework for Implementing Run-time Testing
  • At compile time
  • A) Cost / benefit analysis to determine how the
    loop should be treated
  • B) decide upon any loop transformations based on
    A)
  • C) generate code accordingly

36
Framework for Implementing Run-time Testing
  • At run-time
  • A) checkpoint if necessary
  • B) execute runtime technique in parallel
  • C) collect statistics

37
Related Work
  • 1. Race detection
  • Race conditions and access anomalies detection
    statically, based on an execution trace, or at
    run-time
  • Expensive in terms of
  • Memory requirements
  • Execution time overhead
  • 2. Optimistic execution the virtual time
    concept
  • Applicable to database and discrete event
    simulation rather than loop parallelization

38
Future Directions
  • There is need for new techniques
  • Make more aggressive use of runtime optimizations
  • Need more experimental, statistical data about
    parallel profiles of dynamic codes
  • Architectural support will help reduce the
    run-time overhead
Write a Comment
User Comments (0)
About PowerShow.com