Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines PowerPoint PPT Presentation

presentation player overlay
1 / 25
About This Presentation
Transcript and Presenter's Notes

Title: Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines


1
Software Bubbles Using Predication to Compensate
for Aliasing in Software Pipelines
  • Benjamin Goldberg, Emily CrutcherNYU
  • Chad Huneycutt, Krishna PalemGeorgia Tech

2
Introduction
  • New VLIW/EPIC architectures have hardware
    features for supporting a range of compiler
    optimizations
  • Intel IA64 (Itanium), HP Labs HPL-PD
  • Also several processors for embedded systems
  • e.g. Sharc DSP processor
  • predication is particularly interesting
  • how can we use predication at run-time to enable
    optimizations that the compiler would otherwise
    not be able to perform?
  • This is part of a larger project developing
    run-time tests for optimization and verification.

3
Predication in HPL-PD
  • In HPL-PD, most operations can be predicated
  • they can have an extra operand that is a one-bit
    predicate register.
  • r2 ADD r1,r3 if p2
  • If the predicate register contains 0, the
    operation is not performed
  • The values of predicate registers are typically
    set by compare-to-predicate operations
  • p1 CMPPlt r4,r5

4
Software Pipelining
  • Software Pipelining is the technique of
    scheduling instructions across several iterations
    of a loop.
  • reduces pipeline stalls on sequential pipelined
    machines
  • exploits instruction level parallelism on
    superscalar and VLIW machines
  • intuitively, iterations are overlaid so that an
    iteration starts before the previous iteration
    have completed

sequentialloop
pipelinedloop
5
Constraints on Software Pipelining
  • The instruction-level parallelism in a software
    pipeline is limited by
  • Resource Constraints
  • VLIW instruction width, functional units, bus
    conflicts, etc.
  • Dependence Constraints
  • particularly loop carried dependences between
    iterations
  • arise when
  • the same register is used across several
    iterations
  • the same memory location is used across several
    iterations

Memory Aliasing
6
Aliasing-based Loop Dependences
  • Assembly
  • Source code
  • for(i2 iltni) ai ai-3 c

loadaaddstoreincra3incra
dependence spansthree iterations distance 3
loadaddstoreincra3incra
loadaddstoreincra3incra
Initiation Interval (II)
  • Pipeline

loadaddstoreincra3incra
loadaddstoreincra3incra
kernel1 cycle
loadaddstoreincra3incra
loadaddstoreincra3incra
7
Aliasing-based Loop Dependences
  • Assembly
  • Source code
  • for(i2 iltni) ai ai-1 c

loadaaddstoreincra1incra
distance 1
  • Pipeline

loadaddstoreincra1incra
Initiation Interval (II)
loadaddstoreincra1incra
loadaddstoreincra1incra
kernel3 cycles
8
Dynamic Memory Aliasing
  • What if the code were
  • for(ikiltni) ai ai-k cwhere
    k is unknown at compile time?
  • the dependence distance is the value of k
  • dynamic aliasing
  • The possibilities are
  • k 0 no loop carried dependence
  • k gt 0 loop carried true dependence with
    distance k
  • k lt 0 loop carried anti-dependence with
    distance k
  • The worst case is k 1 (as on previous slide)
  • The compiler has to assume the worst, and
    generate the most pessimistic pipelined schedule

9
Pipelining Despite Aliasing
  • This situation arises quite frequently void
    copy(char a, char b, int size) for(int
    i0iltni) ai bi
  • Distance (b a)
  • What can the compiler do?
  • Generate different versions of the software
    pipeline for different distances
  • branch to the appropriate version at run-time
  • possible code explosion, cost of branch
  • Another alternative Software Bubbling
  • a new technique for Software Pipelining in the
    presence of dynamic aliasing

10
Software Bubbling
  • Compiler generates the most optimistic pipeline
  • constrained only by resource constraints
  • perhaps also by static dependences in the loop
  • All operations in the pipeline kernel are
    predicated
  • rotating predicate registers are especially
    useful, but not necessary
  • The predication pattern determines if the
    operations in a given iteration slot are
    executed
  • The predication pattern is assigned dynamically,
    based on the dependence distance at run time.
  • Continue to use simple example
    for(ikiltni)ai ai-k c

11
Software Bubbling
Optimistic Pipelinefor k gt 2 or k lt 0
Pipeline for k 1
Pipeline for k 2
operationsdisabled by predication
12
The Predication Pattern
  • Each iteration slot is predicated upon a
    different predicate register
  • all operations within the slot are predicated on
    the same predicate register

load if p0add if p0 store if p0 incr
if p0 incr if p0
load if p1add if p1 store if p1 incr
if p1 incr if p1
load if p2add if p2 store if p2 incr
if p2 incr if p2
load if p3add if p3 store if p3 incr
if p3 incr if p3
kernel
load if p0 add if p1 store if p2 incr if
p3 incr if p4
13
Bubbling Predication Pattern
loadaddstoreincrincr
loadaddstoreincrincr
loadaddstoreincrincr
loadaddstoreincrincr
10110
loadaddstoreincrincr
loadaddstoreincrincr
01101
loadaddstoreincrincr
11011
loadaddstoreincrincr
10110
loadaddstoreincrincr
01101
  • The predication pattern in the kernel rotates
  • In this case, the initial pattern is 110110
  • No operation is predicated on the leftmost bit
    in this case
  • Rotating predicate registers are perfect for
    this.

14
Computing the predication pattern
  • Suppose the compiler, based only on static
    constraints, generated an initiation interval of
    II cycles.
  • The number of cycles actually required between
    the source and target iterations of a dynamic
    dependence is latency(sourceOp)
    offset(sourceOp,targetOp)
  • The required distance in iteration slots between
    a source and target iteration is given by L ?
    latency(sourceOp) offset(sourceOp,targetOp) /II
    ?Note that this can be computed at compile time.
  • With a dependence distance of d, however, each
    dependence is from an iteration i to and
    iteration id. Thus, as long as no more than d
    iterations occur within L iterations slots, the
    dependence is preserved.

15
Computing the predication pattern
  • Example
  • Suppose d 2 and latency(store) 1
  • L ?latency(store) offset(store,load)/II?
  • 3, the factor by which the II would have to
    be increased, assuming the dependence spanned one
    iteration
  • The predication pattern should insure that only d
    out of L iterations slots are enabled. In this
    case, 2 out of 3 slots.

offset -2
Static Interval (II)
16
Computing the Predication Pattern (cont)
  • To enable d out of L iteration slots, we simply
    create a bit pattern of length L whose first d
    bits are 1 and the rest are 0.
  • 2d 1.
  • Before entering the loop, we initialize the
    aggregate predicate register (PR) by executing
  • PR shl 1, rd
  • PR sub PR,1
  • where rd contains the value of d (run-time
    value)
  • The predicate register rotation occurs
    automatically using BRF and adding the
    instruction
  • p0 mov pL
  • within the loop, where L is a compile-time
    constant

17
Generalized Software Bubbling
  • So far, weve seen Simple Bubbling
  • d is constant throughout the loop
  • If d changes as the loop progresses, then
    software bubbling can still be performed.
  • The predication pattern changes as well
  • This is called Generalized Bubbling
  • test occurs within the loop
  • iteration slot is only enabled if less than d
    iteration slots out the the previous L slots
    have been enabled.
  • Examples of code requiring generalized bubbling
    appear quite often.
  • Alvinn Spec Benchmark, Lawrence Livermore Loops
    Code

18
Bubbling vs. Hardware Disambiguation
  • EPIC architectures provide hardware support for
    memory disambiguation
  • IA-64 Advanced Load Address Table, ld.a, ld.c
  • HPL-PD lds ldv
  • Allows loads to be moved above possibly aliased
    stores.
  • ldv (or ld.c) will reissue load if aliasing
    occurs
  • Cant this be used for software pipelining?

19
Bubbling vs. Hardware Disambiguation
  • The problem with using hardware disambiguation is
    that the ldv/ld.c must occur after the store.

possibledependence
20
Experimental Results
  • Experiments were performed using the Trimaran
    Compiler Research Infrastructure
  • www.trimaran.org
  • produced by a consortium of HP Labs, UIUC, NYU,
    and Georgia Tech
  • Provides an highly optimizing EPIC compiler
  • Configurable HPL-PD cycle-by-cycle simulator
  • Visualization tools for displaying IR,
    performance, etc.
  • Benchmarks from the literature were identified as
    being amenable for software bubbling

21
Simple Bubbled Loops
Callahan-Dongerra-LevineS152 Loop Benchmark
Matrix Addition
22
Generalized Bubbled Loops
Alvinn SPEC Benchmark
23
Generalized Bubbled Loops (cont)
Lawrence Livermore LoopsKernel 2 Benchmark
24
Related Work
  • Nicolau(89) Memory disambiguation for Multiflow
  • Bernstein,CohenMayday94 Run-time tests for
    array aliasing
  • Su,Habib,Zhao,Wang,Wu94 Empirical study of
    dynamic memory aliasing, run-time checks in
    pipelined code
  • DavidsonJinturker95 Run-time disambiguation
    for unrolling pipelining
  • Warter,Lavery,Hwu93 Predication for
    if-conversion in software pipelines
  • Rau,Schlansker,Tirumalai92 Predication for the
    prolog and epilog of software pipeline

25
Conclusions
  • Modern VLIW/EPIC architectures provide ample
    opportunity, and need, for sophisticated
    optimizations
  • Predication is a very powerful feature of these
    machines
  • Dynamic memory aliasing doesnt have to prevent
    optimizations like software pipelining
  • weve also applied similar techniques to scalar
    replacement, loop interchange, etc.
Write a Comment
User Comments (0)
About PowerShow.com