Title: P3DE: ProfileDirected Predicated Partial Dead Code Elimination
1P3DEProfile-Directed Predicated Partial Dead
Code Elimination
- Shane Ryoo, Sain-Zee Ueng, and Wen-mei W. Hwu
- Coordinated Science Laboratory
- University of Illinois at Urbana-Champaign
- EPIC-5 Workshop, March 26th, 2006
2Motivation
- Even with classical optimizations, there still is
a significant amount of executed code that is
dead Butts ASPLOS 02 - Our experience large amount of dead stores
- Contemporary architectures generally can issue
only one or two stores per cycle, increasing the
length of a schedule in store-laden regions - Can we remove more of this dead code?
- Push assignments off hot paths, towards uses
3Partial Dead Code Elimination
- Partial Dead Code Elimination (PDE) reduces
execution of assignments whose results sometimes
have no effect - Basic algorithm by Knoop et al. PLDI 94 sinks
assignments to remove partially-dead code
assignment moves downward until it is blocked
(a) Before PDE
(b) After PDE
4Previous Aggressive Dead Code Removal Techniques
- Primary weakness of previous methods is their
limitations with cyclic code regions - Gupta et al. PACT 97 present the only previous
work on predicated PDE - Path profile-based method for cost-benefit
analysis - Cannot sink out of loops unless the assignment is
dead along the backedge - Bodik et al. PLDI 97 uses code restructuring
to expose opportunities without introducing extra
dynamic ops - Requires some method to control code growth
- Cannot handle embedded control flow in a loop
5Profile-Directed Predicated Partial Dead Code
Elimination
- Use edge profile information to specialize
program paths beyond basic PDE (essentially
speculate) - Reduce the number of executions, based on profile
- Use predication support to enable aggressive
sinking motion on assignments - Uniform cost-benefit model for cyclic and acyclic
code regions that accounts for predication
overhead - Other optimizations to reduce/eliminate predicate
usage and increase the applicability of the
optimization Ryoo M.S. thesis 04
6P3DE Example
computation of interest
side entry
new location
(a) Before P3DE
(b) After P3DE
7P3DE Algorithm
- Perform dataflow analyses to determine the
possible range of motion for all assignments - Dead partially-dead
- Partially delayable
- For each assignment
- Construct a motion graph representing this range
- Compute a minimum cut of the motion graph, based
on profile weights, to find the smallest number
of executions - Insert new computations, delete old computations,
and use predication as necessary to maintain
correctness - Iterate until no profitable motions remain
8Dead Assignment Dataflow
- Is the assignment dead along ALL/ANY future
paths? - If completely live (or blocked by aliasing
operations), sinking the assignment cannot result
in fewer executions - If completely dead, the assignment does not need
to be executed (inserted)
dataflow direction
9Partial Delay Dataflow
- Can the assignment be sunk down any path with
potential profit? - Filter out assignments which are live (not
profitable to sink) when passing through blocks
dataflow direction
10Motion Graph Construction
- One graph per assignment
- Every CFG edge is included in the motion graph
which is - partially delayable at the origin
- not dead at the destination
11Motion Graph Completion
- Create a single-source, single-sink graph
- Create cost edges to account for predication
overhead (side entries)
12Code Motion
Remove original computation, set predicate
Clear predicate at side entries
Insert new computation on the cut edges
control-equivalent block, execute only when
predicate is set
13Cyclic P3DE Example
14Cyclic P3DE Motion Graph
15Comparison Loop Variable Migration
- The primary benefit of P3DE comes from register
promotion within a loop with aliasing function
calls, when performed with a similar speculative
PRE operation - Bodik Ph.D. thesis 99
- Within IMPACT, this optimization is already
performed to some degree by loop variable
migration, which guards individual aliasing
function calls with loads and stores of the
variable - However, P3DE speculative PRE is a more
systematic method of performing this
optimization, as it guards aliasing regions
16Loop Variable Migration Example
17Performance Evaluation
- Test machine HP zx6000 workstation
- dual Itanium 2, 1GHz processors
- 8GB RAM
- IMPACT compiler configuration
- Andersens-style, context-sensitive,
field-sensitive pointer analysis for memory
disambiguation (incorporated after Ryoo M.S.
thesis 04) - Traditional optimizations, loop optimizations
(including loop variable migration), speculative
PRE, hyperblock, and superblock formation - Baseline runs a sink-only-if-partially-delayable
PDE prior to hyperblock and superblock formation - P3DE version replaces PDE in the optimization
chain and does not run loop variable migration - SPEC scores taken as the median of 5 runs
- Itanium performance counters used to measure
stores and predicate write operations
18Dynamic Store Operations Removed
stores normalized to baseline input
19Predicate Writes Inserted Per Store Removed
Performed on a HP zx6000 with 2 Itanium 2 1GHz
processors and 8GB RAM
predicate writes per store removed
- Some benchmarks omitted due to relatively small
number of stores removed - Many predicate write operations are created by
hyperblock formation, which can affect the total
number of predicate write operations
20SPEC Performance Increase
Performed on a HP zx6000 with 2 Itanium 2 1GHz
processors and 8GB RAM
percentage increase over baseline performance
21Performance Analysis
- Many profitable cases already subsumed by loop
variable migration, so little effect seen in many
benchmarks - 255.vortex achieves performance benefit
- Specific losses
- 186.crafty
- micropipeline stalls loads and stores clustered
together may appear to have conflicts - kernel cycles increase
- 253.perlbmk explicit spill stores due to
register pressure - 254.gap largest portion is branch misprediction
22Conclusions
- P3DE, in combination with speculative PRE, is a
more systematic version of loop variable
migration - P3DE improves performance when stores are moved
out of the critical path of statically-scheduled
regions - In a wide-issue architecture, instructions
generally have significant scheduling freedom, so
P3DE has little additional benefit on performance
over loop variable migration, and can result in - Increased register pressure
- Micropipeline stalls when moving stores towards
loads - Perturbation of hyperblock formation
- An undue amount of predication is not introduced
(due to cost edges)