Qcc Loop Optimization Presentation - PowerPoint PPT Presentation

About This Presentation
Title:

Qcc Loop Optimization Presentation

Description:

... the circular/bit-reverse buffer based on buffer size and the start address. The load/store and circular/bit-reverse update combined in the CG phase: ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 15
Provided by: alex135
Category:

less

Transcript and Presenter's Notes

Title: Qcc Loop Optimization Presentation


1
Development of an efficient DSP Compiler based
on Open64
Subrato K. De, Anshuman Dasgupta, Sundeep
Kushwaha, Tony Linthicum, Susan Brownhill,
Sergei Larin, Taylor Simpson Qualcomm
Incorporated, San Diego Austin, USA.
Email sde, adasgupt, sundeepk, tlinth, yzhu,
slarin, tsimpson_at_qualcomm.com
2
Introduction
  • DSP specific extensions to the C/C language and
    support in Open64 for efficient usage of DSP
    hardware features.
  • Enhancement of WOPT with impact on DSPs for
    mobile devices.
  • Modification of the hyperblock scheduler for DSP
    with limited predication.
  • Register pair allocation.
  • Results at a glance.

3
DSP specific extensions to C/C use of circular
load/store intrinsics
Hex Address Word data
0x0F00
0x0F04
0x0F08
0x0F0C
0x0F10
0x0F14
int coeffPointer for(i0 i lt noOfInputSamples
i) sum 0 coeffPointer coeff0
for(j 0 j lt 4 j) sum
inputSampleij coeffPointer
outputSamplei sum
int coeffPointer int value for(i0 i lt
noOfInputSamples i) sum 0 for(j 0 j
lt 4 j) LOAD_CIRC_INT(value,
coeffPointer, 1, 4, coeff0) sum
inputSampleij value outputSamplei
sum
4
DSP specific extensions to C/C use of
bit-reversed load/store intrinsics
unsigned int ReverseBits (unsigned int index,
unsigned
int NumBits ) unsigned int i, rev for (
irev0 i lt NumBits i ) rev (rev ltlt 1)
(index 1) index gtgt 1 return rev
Standard Bit Reverse Bit
000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111
for ( i0 i lt NumSamples i ) j
ReverseBits ( i, NumBits ) RealOutj
RealIni ImagOutj ImagIni
int realOutPointer int imagOutPointer for (
i0 i lt NumSamples i ) STORE_BREV_INT(RealIn
i, realOutPointer, 1, NumSamples,RealOut0
). STORE_BREV_INT(ImagIni, imagOutPointer, 1,
NumSamples, ImagOut0 ).
5
Design and implementation of circular and
bit-reverse load/store intrinsics
  • Macros expansions.
  • define LOAD_CIRC_INT(v, p, s, l, a) \
  • ( (v) ( p) (p) (int )
    circ_update((void ) (p), (s), (l), (void ) (a))
    )
  • define STORE_CIRC_INT(v, p, s, l, a) \
  • ( ( p) (v) (p) (int )
    circ_update((void ) (p), (s), (l), (void ) (a))
    )
  • The WHIRL IR
  • ltresultgt load_indirect(pointer) OR
    store_indirect(pointer) ltsourcegt
  • ltpointergt circ_update (pointer, step, CR).
  • Where, CR is the configuration register that
    defines the circular/bit-reverse buffer based on
    buffer size and the start address.
  • The load/store and circular/bit-reverse update
    combined in the CG phase
  • ltresult, pointergt load_circular_update(pointer,
    step, CR), OR
  • ltpointergt store_circular_update(source, pointer,
    step, CR)
  • CRs are considered dedicated temporary names
    (TNs)
  • Algorithm for the hoisting of the loop-invariant
    CR assignment statements, and
  • Algorithm for CR allocation, when multiple target
    supports multiple CRs.

6
DSP specific extensions to the C/C use and
optimizations through loop pragmas
  • Loop pragmas
  • pragma LOOP_TRIP_COUNT_MIN (min)
  • pragma LOOP_TRIP_COUNT_MAX(max)
  • pragma LOOP_TRIP_COUNT_MODULO(modulo)
  • pragma LOOP_UNROLL(N)
  • loop guard condition can be removed.
  • loop trip count computation
  • can be simplified (e.g., divide replaced by right
    shifts).
  • can be replaced with a constant count value.
  • alternate loop during software pipelining may not
    be generated.
  • remainder loops during unrolling (LNO GC) may
    not be needed.
  • Unrolling the standard benefits
  • Less loop overhead by reducing the
    branch/jump/comparison.
  • Scheduling can be better due to increased number
    of operations in the loop body.

7
Register promotion of small structs (structures
and unions lt 8 bytes that could fit in register)
THE WHIRL IR USING DISTINCT AUXILIARY SYMBOLS
FOR EACH FIELD IN THE SMALL STRUCT e.g., h1
st 9 w0 st8
typedef union long long d int
w2 short h4 char b8
PAIR PAIR var, Pi, Po
var.h1 3 x var.w0
I4INTCONST 3 (0x3) I2STID 2 ltst 9gt
I4I4LDID 0 ltst 8gt I4STID 0 ltst 5gt
I8I8LDID 0 ltst 4gt I4INTCONST 3 (0x3) I8CVTL
16 I8COMPOSE_BITS ltbofst16 bsize16gt I8STID 0
ltst 4gt I8I8LDID 0 ltst 4gt I8EXTRACT_BITS
ltbofst0 bsize32gt I4STID 0 ltst 5gt
EXTRACT_BITS COMPOSE_BITS TRANSFORMATION
REPLACES AUXILIARY SYMBOLS FOR EACH FIELD OF THE
SMALL STRUCT MEMBER BY THE UNIQUE FULL-SIZED
AUXILIARY REPRESENTING THE FULL STRUCTURE TYPE
e.g., PAIR st 4
8
Hyperblock Formation
  • Hyperblock scheduling can be important on DSPs
  • Many DSP architectures support predicated
    instructions
  • Especially profitable on control code
  • Restricted form of predication on DSP targets
  • Not every instruction is predicatable
  • Architecture decisions mainly due to resource and
    power savings
  • Default hyperblock algorithm rejected too many
    basic blocks due to instructions that arent
    predicatable in the DSP
  • Most potential hyperblocks on our target
    contained a few basic blocks
  • Small hyperblocks exacerbated the problem
  • Needed a more permissive version of hyperblock
    formation

9
Changes to Hyperblock Formation Algorithm
  • Allowed a basic block (e.g. B3) to contain
    non-predicatable instructions if it
    post-dominates the entry block (e.g. B1) of the
    hyperblock

B1 Hyperblock entry blockB3Block contains
non-predicatable instructionsB2, B4, B5 large
completely predicatable basic blocks.
  • Default hyperblock algorithm would reject blocks
    B1-B5
  • Successfully formed hyperblock after our
    modifications
  • Modified algorithm led to more basic blocks
    considered and more hyperblocks formed. Details
    in paper.

10
Efficient Register Pair Allocation
  • Many DSPs support 64-bit register by combining
    two adjacent 32-bit registers.
  • Two choices
  • Use single 8-byte TN
  • Handle all complexities in the register
    allocator.
  • Direct mapping of data-types from whirl to CGIR.
  • Simple semantics 1 result, 2 or more operands.
  • Use two 4-byte TNs
  • Need to handle complexities in GRA/LRA to assign
    adjacent registers.
  • Need to keep a mapping between the 8-byte WN to
    its two 4-byte TNs.
  • Difficult to handle ops producing 8-byte result
    (two 4-byte TNs).
  • Difficult to handle ops using 8-byte source (each
    input is two 4-byte TNs)
  • Accessing lower / higher part of register pair is
    trivial.

11
Motivating example
include ltstdio.hgt extern long long bar(int a,
long long b) extern int baz(long long) int
foo(long long a, int c) long long tmp c
bar(1, a) int retval (tmpgtgt32) if (c gt 0)
tmp return retval baz(tmp)
  • C Example
  • After Expansion
  • After EBO

11 TN2498 - asr_i_p TN2428 (0x20) 11
TN2524 - tfr TN2498 11 TN1(r0)4 - add
TN2484 TN2524
11 TN2524 - pseudo_pair_high GTN2428
11 GTN1(r0)4 - add GTN1(r0)4ltdefopndgt TN2524

12
TN2524 - pseudo_pair_high GTN2428
  • Pseudo pair instructions are expanded after
    register allocation
  • NOP if high of GTN2428 is allocated the same
    register as TN2524.
  • Copy otherwise.
  • We want to avoid copy
  • Make sure both high of GTN2428 and TN2524 share
    the same register.
  • Problem
  • LRA and GRA phases work independently.
  • Live ranges of source TN and result TN can
    interfere in many ways.
  • GLOBAL GLOBAL Both handled in
    GRA
  • GLOBAL LOCAL Needs GRA LRA
    interaction
  • LOCAL GLOBAL Needs GRA LRA
    interaction
  • LOCAL LOCAL Both handled in
    LRA
  • LRA does not build the interference graph.

13
The solution strategy
  • Globalize any local TN if the other TN is global
  • TN2524 - pseudo_pair_high GTN2428
  • Need to solve only 2 cases
  • GLOBAL GLOBAL Both handled in
    GRA
  • LOCAL LOCAL Both handled in
    LRA
  • Minor changes in GRA
  • GRA considers pseudo pair instructions as copy
  • Attempts to remove copy by preferencing
  • Some changes in LRA
  • Do very simple liveness analysis to identify if
    pseudo pair op TNs can preference i.e. share same
    register
  • Maintain a list of preference TN in each TNs
    live range
  • When choosing color for a TN, check if any TN in
    its preference list is already allocated and get
    the same register if it is available

14
Results at a glance
  • Enhanced Open64 v.s. baseline Open64 improvements
  • WOPT enhancement e.g., on modem applications
  • Cycle count reduced 3 to 40,
  • stack reduced as much as 50,
  • code size reduced 1 to 2
  • Register-pair allocation
  • Telecommunication cycle count reduced 3.91 on
    average.
  • kernel codes cycle count reduced 1.77 on
    average.
  • Hyperblock enhancement (illustrated in terms of
    the ratio of HBs formed or BBs considered by the
    modified algorithm when compared to the
    default algorithm)
  • Networking HBs formed 1.13
    BBs considered 1.72
  • Telecommunication HBs formed 1.00
    BBs considered 1.49
Write a Comment
User Comments (0)
About PowerShow.com