Register Pressure Guided UnrollandJam - PowerPoint PPT Presentation

About This Presentation

Title:

Register Pressure Guided UnrollandJam

Description:

In a processor, register sits at the fastest position in the memory ... [Prelude] D1. B1 D2 [Loop Body] Do N-2 times (with index i)? Ai Ci Bi 1 Di 2 [Postlude] ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 19

Provided by: yin95

Learn more at: https://www.capsl.udel.edu

Category:

more less

Transcript and Presenter's Notes

Title: Register Pressure Guided UnrollandJam

1
Register Pressure Guided Unroll-and-Jam

Author Yin Ma
Steven Carr

2
Motivation

In a processor, register sits at the fastest
position in the memory hierarchy, but the number
of physical registers is very limited.
Unroll-and-jam in the loop model of Open64 not
only increases register pressure by itself but
also creates opportunities to make other loop
optimizations increase register pressure
indirectly.
If a transformed loop demands too many registers,
the overall performance may degrade
Given a loop nest, with a better register
pressure prediction and an unroll factor, the
degradation can be eliminated and a better
overall performance can be achieved

3
Research Topic

A register pressure prediction algorithm for
unroll-and-jam
A register pressure guided loop model for
unroll-and-jam

4
BackgroundData Dependence Analysis
True Dependence S1 L1. S2 .L2 Anti-Dependenc
e S1 .L1 S2 L2. Output Dependence S1 L1.
S2 L2. Input Dependence S1 .L1 S2 .L2

The data dependence graph (DDG) is a directed
graph that represents the data dependence
relationship among instructions.
A true dependence exists when L1 stores into a
memory location that is read by L2 later.
An anti-dependence exists if L1 is a read from a
memory location that is written by L2 later.
An output dependence exists when L1 and L2 store
into the same memory location.
An input dependence exists if a memory location
is read by L1 and L2.

5
BackgroundScalar Replacement

Uses scalars, later allocated to registers to
replace array references in order to decrease the
number of memory references in loops
This directly increases register pressure

for ( i 2 i lt n i ) ai ai-1 bi
Scalar Replaced T a1 for ( i 2 i lt n
i) T T bi ai T
6
BackgroundUnroll-and-Jam

Create larger loop bodies by flattening multiple
iterations
Larger loop bodies makes other optimizations
create more register pressure

Unroll-and-jammed and later scalar replaced for (
I 1 I lt 10 I I2 ) for ( J 1 J lt
5 J ) b BJ c CJ
AIJ b c DIJ EIJ
FIJ AI1J b c
DI1J EI1J FI1J /
register pressure increased because b,
c hold two registers that originally
can be reused for E and F /
for ( I 1 I lt 10 I ) for ( J 1 J
lt 5 J ) AIJ BJ CJ
DIJ EIJ FIJ
?
7
BackgroundSoftware Pipelining

Software pipelining is an advanced scheduling
techniques. Usually, more-overlapped instructions
demand additional registers
The Initiation interval (II) of a loop is the
number of cycles used to finish one iteration.
The resource II (ResII) gives the minimum number
of cycles needed to execute the loop based upon
machine resources such as the number of
functional units.
The recurrence II (RecII) gives the minimum
number of cycles needed for a single iteration
based upon the length of the cycles in the data
dependence graph.

Prelude D1 B1 D2 Loop Body Do N-2 times
(with index i)? Ai Ci Bi1 Di2 Postlude AN-1 C
N-1 BN AN CN
Do N times
Software pipelined due to dependences among the
operations
8
Typical approaches of preventing degradation from
register pressure

Predictive approach lt- Our approaches
Predict effects before applying optimizations and
decide the best set of parameters to do
optimizations
Fastest speed and fit for all situations
Iterative approach (like feedback based)?
Apply optimizations with one set of parameters
then redo for the better performance with
adjusted parameters
Genetic approach
Prepare many sets of parameters and apply
optimizations with each set. Use genetic
programming to pick the best

9
Problem in Previous Work

All predictive register prediction methods are
designed for software pipelining.
Do not support source-code-level loop
optimizations at all
No systemic research on how to predict register
pressure for loop optimizations
No register pressure guided loop model

10
Key Design Detail

Prediction algorithms works on source-code level
Prediction algorithms handle the effects on
register pressure from
unroll-and-jam
scalar replacement
software pipelining
general scalar optimizations
Register pressure guided loop model uses the
predicted register information to pick an unroll
vector for the best performance

11
Register Prediction for unroll-and-jam
(Overview)?

Compute RecII with our heuristic method
Create the list of arrays that will be replaced
by scalars by checking the original DDG
Constructing the new DDG D1 with the list above
only for the original loop
All copies will reuse the DDG D1 as the base DDGs
Adjust each copy of DDGs to reflect the future
changes.
Re-compute the ResII to get MinII
Do pseudo schedule to get the register pressure

12
Construct the base DDG

Travel through the innermost loop and construct
the base DDG

DO J 1, N DO I 1, N U(I,J) V(I)
P(J,I) ENDDOENDDO
13
Prepare the DDG after unroll-and-jam

Duplicate the base DDG with the inputted unroll
factors

DO J 1, N DO I 1, N U(I,J) V(I)
P(J,I) U(I,J1) V(I) P(J1,I)
ENDDOENDDO
Unroll vector is 2
14
Finalize the DDG

Remove unnecessary nodes/edges and add new edges
Based on the updated dependence
Reflect the effect of further optimizations
Consider array indexing reuse by analyzing array
subscripts

15
Register Prediction

Schedule the final DDG with a depth-first scan
starting from the first node of the first
iteration copy
The RecII is the RecII of the original innermost
loop
The ResII is computed on the final DDG with the
targeted architecture information
MinII MAX( RecII, ResII)?

16
Register Pressure Guided Unroll-and-Jam

Use unitII as the performance indicator of an
unroll-and-jammed loop
R is the number of registers predicted
P is the number of registers available
D is the total outgoing degree in the final DDG
E is the total number of cross iteration edges
A is the average memory access penalty
N is the number of nodes in the final DDG

17
Open64 Implementation Experiment Results

For register prediction, a retargetable compiler
with infinite number of available physical
registers is used
Loop nests are extracted from SPEC2000
For register pressure guided unroll-and-jam, our
model directly replaces the unroll-and-jam
analysis used by Open64 backend
An minor value computed with the information from
Open64's cache model is added to UnitII
For register prediction for unroll-and-jam, it
predicts the floating-point register pressure of
a loop within 3 registers and integer register
pressure within 4 registers
Also our register pressure guided unroll-and-jam
improves the overall performance about 2 over
the model in the Open64 backend on both x86 and
x86-64 architectures on Polyhedron benchmark