Title: The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops
1The R-LRPD TestSpeculative Parallelization of
Partially Parallel Loops
- Francis Dang, Hao Yu, and Lawrence Rauchwerger
- Department of Computer Science
- Texas AM University
2Motivation
- To maximize performance, extract the maximum
available parallelism from loops. - Static compiler methods may be insufficient.
- Access patterns may be too complex.
- Required information is only available at
runtime. - Run-time methods needed to extract loop
parallelism - Inspector/Executor
- Speculative Parallelization
3Speculative Parallelization LRPD Test
- Main Idea
- Execute a loop as a DOALL.
- Record memory references during execution.
- Check for data dependences.
- If there was a dependence, re-execute the loop
sequentially. - Disadvantages
- One data dependence can invalidate speculative
parallelization. - Slowdown is proportional to speculative parallel
execution time. - Partial parallelism is not exploited.
4Partially Parallel Loop Example
do i 1, 8 z AKi ALi z Ci end do K18 1,2,3,1,4,2,1,1 L18 4,5,5,4,3,5,3,3
iter 1 2 3 4 5 6 7 8
A()
1 R R R R
2 R R
3 R W W W
4 W W R
5 W W W
5The Recursive LRPD
- Main Idea
- Transform a partially parallel loop into a
sequence of fully parallel, block-scheduled
loops. - Iterations before the first data dependence are
correct and committed. - Re-apply the LRPD test on the remaining
iterations. - Worst case
- Sequential time plus testing overhead
6Algorithm
7Implementation
- Implemented in run-time pass in Polaris and
additional hand-inserted code. - Privatization with copy-in/copy-out for arrays
under test. - Replicated buffers for reductions.
- Backup arrays for checkpointing.
8Recursive LRPD Example
do i 1, 8 z AKi ALi z Ci end do K18 1,2,3,1,4,2,1,1 L18 4,5,5,4,2,5,3,3
9Heuristics
- Work Redistribution
- Sliding Window Approach
- Data Dependence Graph Extraction
10Work Redistribution
- Redistribute remaining iterations across
processors. - Execution time for each stage will decrease.
- Disadvantages
- May uncover new dependences across processors.
- May incur remote cache misses from data
redistribution.
11Work Redistribution Example
do i 1, 8 z AKi ALi z Ci end do K18 1,2,3,1,4,2,1,1 L18 4,5,5,4,2,5,3,3
12Redistribution Model
- Redistribution may not always be beneficial.
- Stop redistribution if
- The cost of data redistribution outweighs the
benefit from work redistribution. - Synthetic loop to model this adaptive method.
13Redistribution Model
14Sliding Window R-LRPD
- R-LRPD can generate a sequential schedule for
long dependence distributions. - Strip-mine the speculative execution.
- Apply the R-LRPD on a contiguous block of
iterations. - Only dependences within the window cause
failures. - Adds more global synchronizations and test
overhead.
15DDG Extraction
- R-LRPD can generate sequential schedules for
complex dependence distributions. - Use the SW R-LRPD scheme to extract the data
dependence graph (DDG). - Generate an optimized schedule from the DDG.
- Obtains the DDG for loops from which a proper
inspector cannot be extracted.
16Performance Issues
- Performance issues
- Blocked scheduling potential cause for load
imbalance. - Checkpointing can be expensive.
- Feedback guided blocked scheduling
- Use the timing information from the previous
instantiation (Bull, EuroPar 98) - Estimate the processor chunk sizes for minimal
load imbalance. - On-Demand Checkpointing
- Checkpoint only data modified during execution.
17Experiments
- Setup
- 16 processor HP V-Class
- 4 GB memory
- HP-UX 11.0
18Experimental Results Input Profiles
19Experimental Results - TRACK
20Experimental Results - TRACK
21Experimental Results - TRACK
22Experimental Results - TRACK
23Experimental Results Sliding Window
24Experimental Results Sliding Window
25Experimental Results FMA3D
26Experimental Results SPICE 2G6
27Conclusion
- Contribution
- Can speculatively parallelize any loop.
- Concern is now optimizing the parallelization and
not when to parallelize. - Future work
- Use dependence distribution information for
adaptive redistribution and scheduling.