Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines presentation

About This Presentation

Transcript and Presenter's Notes

Title: Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines

1
Software Bubbles Using Predication to Compensate
for Aliasing in Software Pipelines

Benjamin Goldberg, Emily CrutcherNYU
Chad Huneycutt, Krishna PalemGeorgia Tech

2
Introduction

New VLIW/EPIC architectures have hardware
features for supporting a range of compiler
optimizations
Intel IA64 (Itanium), HP Labs HPL-PD
Also several processors for embedded systems
e.g. Sharc DSP processor
predication is particularly interesting
how can we use predication at run-time to enable
optimizations that the compiler would otherwise
not be able to perform?
This is part of a larger project developing
run-time tests for optimization and verification.

3
Predication in HPL-PD

In HPL-PD, most operations can be predicated
they can have an extra operand that is a one-bit
predicate register.
r2 ADD r1,r3 if p2
If the predicate register contains 0, the
operation is not performed
The values of predicate registers are typically
set by compare-to-predicate operations
p1 CMPPlt r4,r5

4
Software Pipelining

Software Pipelining is the technique of
scheduling instructions across several iterations
of a loop.
reduces pipeline stalls on sequential pipelined
machines
exploits instruction level parallelism on
superscalar and VLIW machines
intuitively, iterations are overlaid so that an
iteration starts before the previous iteration
have completed

sequentialloop
pipelinedloop
5
Constraints on Software Pipelining

The instruction-level parallelism in a software
pipeline is limited by
Resource Constraints
VLIW instruction width, functional units, bus
conflicts, etc.
Dependence Constraints
particularly loop carried dependences between
iterations
arise when
the same register is used across several
iterations
the same memory location is used across several
iterations

Memory Aliasing
6
Aliasing-based Loop Dependences

Assembly

Source code
for(i2 iltni) ai ai-3 c

loadaaddstoreincra3incra
dependence spansthree iterations distance 3
loadaddstoreincra3incra
loadaddstoreincra3incra
Initiation Interval (II)

Pipeline

loadaddstoreincra3incra
loadaddstoreincra3incra
kernel1 cycle
loadaddstoreincra3incra
loadaddstoreincra3incra
7
Aliasing-based Loop Dependences

Assembly

Source code
for(i2 iltni) ai ai-1 c

loadaaddstoreincra1incra
distance 1

Pipeline

loadaddstoreincra1incra
Initiation Interval (II)
loadaddstoreincra1incra
loadaddstoreincra1incra
kernel3 cycles
8
Dynamic Memory Aliasing

What if the code were
for(ikiltni) ai ai-k cwhere
k is unknown at compile time?
the dependence distance is the value of k
dynamic aliasing
The possibilities are
k 0 no loop carried dependence
k gt 0 loop carried true dependence with
distance k
k lt 0 loop carried anti-dependence with
distance k
The worst case is k 1 (as on previous slide)
The compiler has to assume the worst, and
generate the most pessimistic pipelined schedule

9
Pipelining Despite Aliasing

This situation arises quite frequently void
copy(char a, char b, int size) for(int
i0iltni) ai bi
Distance (b a)
What can the compiler do?
Generate different versions of the software
pipeline for different distances
branch to the appropriate version at run-time
possible code explosion, cost of branch
Another alternative Software Bubbling
a new technique for Software Pipelining in the
presence of dynamic aliasing

10
Software Bubbling

Compiler generates the most optimistic pipeline
constrained only by resource constraints
perhaps also by static dependences in the loop
All operations in the pipeline kernel are
predicated
rotating predicate registers are especially
useful, but not necessary
The predication pattern determines if the
operations in a given iteration slot are
executed
The predication pattern is assigned dynamically,
based on the dependence distance at run time.
Continue to use simple example
for(ikiltni)ai ai-k c

11
Software Bubbling
Optimistic Pipelinefor k gt 2 or k lt 0
Pipeline for k 1
Pipeline for k 2
operationsdisabled by predication
12
The Predication Pattern

Each iteration slot is predicated upon a
different predicate register
all operations within the slot are predicated on
the same predicate register

load if p0add if p0 store if p0 incr
if p0 incr if p0
load if p1add if p1 store if p1 incr
if p1 incr if p1
load if p2add if p2 store if p2 incr
if p2 incr if p2
load if p3add if p3 store if p3 incr
if p3 incr if p3
kernel
load if p0 add if p1 store if p2 incr if
p3 incr if p4
13
Bubbling Predication Pattern
loadaddstoreincrincr
loadaddstoreincrincr
loadaddstoreincrincr
loadaddstoreincrincr
10110
loadaddstoreincrincr
loadaddstoreincrincr
01101
loadaddstoreincrincr
11011
loadaddstoreincrincr
10110
loadaddstoreincrincr
01101

The predication pattern in the kernel rotates
In this case, the initial pattern is 110110
No operation is predicated on the leftmost bit
in this case
Rotating predicate registers are perfect for
this.

14
Computing the predication pattern

Suppose the compiler, based only on static
constraints, generated an initiation interval of
II cycles.
The number of cycles actually required between
the source and target iterations of a dynamic
dependence is latency(sourceOp)
offset(sourceOp,targetOp)
The required distance in iteration slots between
a source and target iteration is given by L ?
latency(sourceOp) offset(sourceOp,targetOp) /II
?Note that this can be computed at compile time.
With a dependence distance of d, however, each
dependence is from an iteration i to and
iteration id. Thus, as long as no more than d
iterations occur within L iterations slots, the
dependence is preserved.

15
Computing the predication pattern

Example
Suppose d 2 and latency(store) 1
L ?latency(store) offset(store,load)/II?
3, the factor by which the II would have to
be increased, assuming the dependence spanned one
iteration
The predication pattern should insure that only d
out of L iterations slots are enabled. In this
case, 2 out of 3 slots.

offset -2
Static Interval (II)
16
Computing the Predication Pattern (cont)

To enable d out of L iteration slots, we simply
create a bit pattern of length L whose first d
bits are 1 and the rest are 0.
2d 1.
Before entering the loop, we initialize the
aggregate predicate register (PR) by executing
PR shl 1, rd
PR sub PR,1
where rd contains the value of d (run-time
value)
The predicate register rotation occurs
automatically using BRF and adding the
instruction
p0 mov pL
within the loop, where L is a compile-time
constant

17
Generalized Software Bubbling

So far, weve seen Simple Bubbling
d is constant throughout the loop
If d changes as the loop progresses, then
software bubbling can still be performed.
The predication pattern changes as well
This is called Generalized Bubbling
test occurs within the loop
iteration slot is only enabled if less than d
iteration slots out the the previous L slots
have been enabled.
Examples of code requiring generalized bubbling
appear quite often.
Alvinn Spec Benchmark, Lawrence Livermore Loops
Code

18
Bubbling vs. Hardware Disambiguation

EPIC architectures provide hardware support for
memory disambiguation
IA-64 Advanced Load Address Table, ld.a, ld.c
HPL-PD lds ldv
Allows loads to be moved above possibly aliased
stores.
ldv (or ld.c) will reissue load if aliasing
occurs
Cant this be used for software pipelining?

19
Bubbling vs. Hardware Disambiguation

The problem with using hardware disambiguation is
that the ldv/ld.c must occur after the store.

possibledependence
20
Experimental Results

Experiments were performed using the Trimaran
Compiler Research Infrastructure
www.trimaran.org
produced by a consortium of HP Labs, UIUC, NYU,
and Georgia Tech
Provides an highly optimizing EPIC compiler
Configurable HPL-PD cycle-by-cycle simulator
Visualization tools for displaying IR,
performance, etc.
Benchmarks from the literature were identified as
being amenable for software bubbling

21
Simple Bubbled Loops
Callahan-Dongerra-LevineS152 Loop Benchmark
Matrix Addition
22
Generalized Bubbled Loops
Alvinn SPEC Benchmark
23
Generalized Bubbled Loops (cont)
Lawrence Livermore LoopsKernel 2 Benchmark
24
Related Work

Nicolau(89) Memory disambiguation for Multiflow
Bernstein,CohenMayday94 Run-time tests for
array aliasing
Su,Habib,Zhao,Wang,Wu94 Empirical study of
dynamic memory aliasing, run-time checks in
pipelined code
DavidsonJinturker95 Run-time disambiguation
for unrolling pipelining
Warter,Lavery,Hwu93 Predication for
if-conversion in software pipelines
Rau,Schlansker,Tirumalai92 Predication for the
prolog and epilog of software pipeline

25
Conclusions

Modern VLIW/EPIC architectures provide ample
opportunity, and need, for sophisticated
optimizations
Predication is a very powerful feature of these
machines
Dynamic memory aliasing doesnt have to prevent
optimizations like software pipelining
weve also applied similar techniques to scalar
replacement, loop interchange, etc.

Write a Comment

User Comments (0)

About PowerShow.com

Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines PowerPoint PPT Presentation