Title: Register Pressure Guided UnrollandJam
1Register Pressure Guided Unroll-and-Jam
- Author Yin Ma
- Steven Carr
2Motivation
- In a processor, register sits at the fastest
position in the memory hierarchy, but the number
of physical registers is very limited. - Unroll-and-jam in the loop model of Open64 not
only increases register pressure by itself but
also creates opportunities to make other loop
optimizations increase register pressure
indirectly. - If a transformed loop demands too many registers,
the overall performance may degrade - Given a loop nest, with a better register
pressure prediction and an unroll factor, the
degradation can be eliminated and a better
overall performance can be achieved
3Research Topic
- A register pressure prediction algorithm for
unroll-and-jam - A register pressure guided loop model for
unroll-and-jam
4BackgroundData Dependence Analysis
True Dependence S1 L1. S2 .L2 Anti-Dependenc
e S1 .L1 S2 L2. Output Dependence S1 L1.
S2 L2. Input Dependence S1 .L1 S2 .L2
- The data dependence graph (DDG) is a directed
graph that represents the data dependence
relationship among instructions. - A true dependence exists when L1 stores into a
memory location that is read by L2 later. - An anti-dependence exists if L1 is a read from a
memory location that is written by L2 later. - An output dependence exists when L1 and L2 store
into the same memory location. - An input dependence exists if a memory location
is read by L1 and L2.
5BackgroundScalar Replacement
- Uses scalars, later allocated to registers to
replace array references in order to decrease the
number of memory references in loops - This directly increases register pressure
for ( i 2 i lt n i ) ai ai-1 bi
Scalar Replaced T a1 for ( i 2 i lt n
i) T T bi ai T
6BackgroundUnroll-and-Jam
- Create larger loop bodies by flattening multiple
iterations - Larger loop bodies makes other optimizations
create more register pressure
Unroll-and-jammed and later scalar replaced for (
I 1 I lt 10 I I2 ) for ( J 1 J lt
5 J ) b BJ c CJ
AIJ b c DIJ EIJ
FIJ AI1J b c
DI1J EI1J FI1J /
register pressure increased because b,
c hold two registers that originally
can be reused for E and F /
for ( I 1 I lt 10 I ) for ( J 1 J
lt 5 J ) AIJ BJ CJ
DIJ EIJ FIJ
?
7BackgroundSoftware Pipelining
- Software pipelining is an advanced scheduling
techniques. Usually, more-overlapped instructions
demand additional registers - The Initiation interval (II) of a loop is the
number of cycles used to finish one iteration. - The resource II (ResII) gives the minimum number
of cycles needed to execute the loop based upon
machine resources such as the number of
functional units. - The recurrence II (RecII) gives the minimum
number of cycles needed for a single iteration
based upon the length of the cycles in the data
dependence graph.
Prelude D1 B1 D2 Loop Body Do N-2 times
(with index i)? Ai Ci Bi1 Di2 Postlude AN-1 C
N-1 BN AN CN
Do N times
Software pipelined due to dependences among the
operations
8Typical approaches of preventing degradation from
register pressure
- Predictive approach lt- Our approaches
- Predict effects before applying optimizations and
decide the best set of parameters to do
optimizations - Fastest speed and fit for all situations
- Iterative approach (like feedback based)?
- Apply optimizations with one set of parameters
then redo for the better performance with
adjusted parameters - Genetic approach
- Prepare many sets of parameters and apply
optimizations with each set. Use genetic
programming to pick the best
9Problem in Previous Work
- All predictive register prediction methods are
designed for software pipelining. - Do not support source-code-level loop
optimizations at all - No systemic research on how to predict register
pressure for loop optimizations - No register pressure guided loop model
10Key Design Detail
- Prediction algorithms works on source-code level
- Prediction algorithms handle the effects on
register pressure from - unroll-and-jam
- scalar replacement
- software pipelining
- general scalar optimizations
- Register pressure guided loop model uses the
predicted register information to pick an unroll
vector for the best performance
11Register Prediction for unroll-and-jam
(Overview)?
- Compute RecII with our heuristic method
- Create the list of arrays that will be replaced
by scalars by checking the original DDG - Constructing the new DDG D1 with the list above
only for the original loop - All copies will reuse the DDG D1 as the base DDGs
- Adjust each copy of DDGs to reflect the future
changes. - Re-compute the ResII to get MinII
- Do pseudo schedule to get the register pressure
12Construct the base DDG
- Travel through the innermost loop and construct
the base DDG
DO J 1, N DO I 1, N U(I,J) V(I)
P(J,I) ENDDOENDDO
13Prepare the DDG after unroll-and-jam
- Duplicate the base DDG with the inputted unroll
factors
DO J 1, N DO I 1, N U(I,J) V(I)
P(J,I) U(I,J1) V(I) P(J1,I)
ENDDOENDDO
Unroll vector is 2
14Finalize the DDG
- Remove unnecessary nodes/edges and add new edges
- Based on the updated dependence
- Reflect the effect of further optimizations
- Consider array indexing reuse by analyzing array
subscripts
15Register Prediction
- Schedule the final DDG with a depth-first scan
starting from the first node of the first
iteration copy - The RecII is the RecII of the original innermost
loop - The ResII is computed on the final DDG with the
targeted architecture information - MinII MAX( RecII, ResII)?
16Register Pressure Guided Unroll-and-Jam
- Use unitII as the performance indicator of an
unroll-and-jammed loop -
- R is the number of registers predicted
- P is the number of registers available
- D is the total outgoing degree in the final DDG
- E is the total number of cross iteration edges
- A is the average memory access penalty
- N is the number of nodes in the final DDG
17Open64 Implementation Experiment Results
- For register prediction, a retargetable compiler
with infinite number of available physical
registers is used - Loop nests are extracted from SPEC2000
- For register pressure guided unroll-and-jam, our
model directly replaces the unroll-and-jam
analysis used by Open64 backend - An minor value computed with the information from
Open64's cache model is added to UnitII - For register prediction for unroll-and-jam, it
predicts the floating-point register pressure of
a loop within 3 registers and integer register
pressure within 4 registers - Also our register pressure guided unroll-and-jam
improves the overall performance about 2 over
the model in the Open64 backend on both x86 and
x86-64 architectures on Polyhedron benchmark
18The End
Any Question?