Speculative Parallelization of Partially Parallel Loops - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Speculative Parallelization of Partially Parallel Loops

Description:

Required information is not available at compile-time. ... Reapply the LRPD test on the remaining iterations. Blocked scheduling. Worst case ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 24
Provided by: Quix6
Category:

less

Transcript and Presenter's Notes

Title: Speculative Parallelization of Partially Parallel Loops


1
Speculative Parallelization of Partially Parallel
Loops
  • Francis Dang and Dr. Lawrence Rauchwerger
  • Department of Computer Science,
  • Texas AM University
  • http//www.cs.tamu.edu/people/fhd4244
  • http//www.cs.tamu.edu/faculty/rwerger

2
Motivation
  • Static compiler methods cannot always extract all
    the parallelism in loops because
  • Access patterns are too complex.
  • Required information is not available at
    compile-time.
  • Can use run-time methods to parallelize more
    loops.

3
Partially Parallel Loops
  • Loops can be
  • Fully parallel (doall)
  • Fully sequential
  • Partially parallel
  • Partially parallel loops
  • Not all iterations can be executed independently
  • May still have enough parallelism to exploit

4
Partially Parallel Loops Example
  • do i 1, 8
  • z AKi
  • ALi z Ci
  • enddo
  • K15 1,2,3,1,4,2,1,1
  • L15 4,5,5,4,3,5,3,3

5
Related Work
  • Inspector/Executor (Saltz, Zhu and Yew,
    Rauchwerger, Mellor-Crummey, et al.)
  • Advantage Works well for partially parallel
    loops
  • Disadvantages
  • A proper side-effect free inspector is not always
    available.
  • Can require large additional data structures
  • LRPD Test (Rauchwerger)
  • Advantage Works well for fully parallel loops
  • Disadvantage Slowdown is proportional to the
    speculative parallel execution time.

6
Recursive LRPD
  • Main Idea
  • Transform a partially parallel loop into a
    sequence of fully parallel loops.
  • Iterations before the first data dependence are
    correct and committed.
  • Reapply the LRPD test on the remaining
    iterations.
  • Blocked scheduling
  • Worst case
  • Sequential time plus testing overhead.

7
Recursive LRPD Algorithm
8
Recursive LRPD Implementation
  • Implemented with our run-time pass in Polaris and
    hand-inserted code.
  • Privatization for arrays under test
  • Replicated buffers for reductions.
  • Checkpoint shared arrays.
  • Record memory references during execution.

9
Recursive LRPD Example
  • do i 1, 8
  • z AKi
  • ALi z Ci
  • enddo

K15 1,2,3,1,4,2,1,1 L15
4,5,5,4,2,5,3,3
10
Work Redistribution
  • Redistribute remaining iterations across all
    processors.
  • Advantage
  • Execution time for each stage will decrease.
  • Disadvantages
  • May uncover new dependences across processors.
  • May have remote cache misses from data
    redistribution.

11
Work Redistribution Illustration
With Redistribution
Without Redistribution
12
Work Redistribution Example
  • do i 1, 8
  • z AKi
  • ALi z Ci
  • enddo

K15 1,2,3,1,4,2,1,1 L15
4,5,5,4,2,5,3,3
13
Redistribution Model
  • Redistribution may not always be beneficial.
  • Stop redistribution if
  • The cost of data redistribution outweighs the
    benefit from redistribution.
  • Used a synthetic loop to model this adaptive
    method.

14
Redistribution Model
15
Experiments
  • Setup
  • 16 processor HP V-Class
  • 4 GB memory
  • HP-UX 11.0
  • Codes and Loops

16
Input Profile
17
Experimental Results
18
Experimental Results
19
Experimental Results
20
Experimental Results
21
Issues
  • May have load imbalance due to blocked
    scheduling.
  • Checkpointing can be expensive.
  • Work redistribution
  • May uncover more dependencies.
  • May cause remote cache misses from data
    redistribution.

22
Feedback Guided Block Scheduling
  • Use the timing information from the previous
    instantiation (Bull, EuroPar 98).
  • Estimate the processor chunk sizes for minimal
    load imbalance.

23
Conclusion
  • Contributions
  • Can speculatively parallelize any loop.
  • Concern is now not when to parallelize but
    optimizing the parallelization.
  • Future Work
  • Use feedback-guided block scheduling to minimize
    load imbalance.
  • Decrease run-time overhead.
  • Use dependence distribution information for
    adaptive redistribution and scheduling.
Write a Comment
User Comments (0)
About PowerShow.com