Title: Qcc Loop Optimization Presentation
1Development of an efficient DSP Compiler based
on Open64
Subrato K. De, Anshuman Dasgupta, Sundeep
Kushwaha, Tony Linthicum, Susan Brownhill,
Sergei Larin, Taylor Simpson Qualcomm
Incorporated, San Diego Austin, USA.
Email sde, adasgupt, sundeepk, tlinth, yzhu,
slarin, tsimpson_at_qualcomm.com
2Introduction
- DSP specific extensions to the C/C language and
support in Open64 for efficient usage of DSP
hardware features. - Enhancement of WOPT with impact on DSPs for
mobile devices. - Modification of the hyperblock scheduler for DSP
with limited predication. - Register pair allocation.
- Results at a glance.
3DSP specific extensions to C/C use of circular
load/store intrinsics
Hex Address Word data
0x0F00
0x0F04
0x0F08
0x0F0C
0x0F10
0x0F14
int coeffPointer for(i0 i lt noOfInputSamples
i) sum 0 coeffPointer coeff0
for(j 0 j lt 4 j) sum
inputSampleij coeffPointer
outputSamplei sum
int coeffPointer int value for(i0 i lt
noOfInputSamples i) sum 0 for(j 0 j
lt 4 j) LOAD_CIRC_INT(value,
coeffPointer, 1, 4, coeff0) sum
inputSampleij value outputSamplei
sum
4DSP specific extensions to C/C use of
bit-reversed load/store intrinsics
unsigned int ReverseBits (unsigned int index,
unsigned
int NumBits ) unsigned int i, rev for (
irev0 i lt NumBits i ) rev (rev ltlt 1)
(index 1) index gtgt 1 return rev
Standard Bit Reverse Bit
000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111
for ( i0 i lt NumSamples i ) j
ReverseBits ( i, NumBits ) RealOutj
RealIni ImagOutj ImagIni
int realOutPointer int imagOutPointer for (
i0 i lt NumSamples i ) STORE_BREV_INT(RealIn
i, realOutPointer, 1, NumSamples,RealOut0
). STORE_BREV_INT(ImagIni, imagOutPointer, 1,
NumSamples, ImagOut0 ).
5Design and implementation of circular and
bit-reverse load/store intrinsics
- Macros expansions.
- define LOAD_CIRC_INT(v, p, s, l, a) \
- ( (v) ( p) (p) (int )
circ_update((void ) (p), (s), (l), (void ) (a))
) - define STORE_CIRC_INT(v, p, s, l, a) \
- ( ( p) (v) (p) (int )
circ_update((void ) (p), (s), (l), (void ) (a))
) - The WHIRL IR
- ltresultgt load_indirect(pointer) OR
store_indirect(pointer) ltsourcegt - ltpointergt circ_update (pointer, step, CR).
- Where, CR is the configuration register that
defines the circular/bit-reverse buffer based on
buffer size and the start address. - The load/store and circular/bit-reverse update
combined in the CG phase - ltresult, pointergt load_circular_update(pointer,
step, CR), OR - ltpointergt store_circular_update(source, pointer,
step, CR) - CRs are considered dedicated temporary names
(TNs) - Algorithm for the hoisting of the loop-invariant
CR assignment statements, and - Algorithm for CR allocation, when multiple target
supports multiple CRs.
6DSP specific extensions to the C/C use and
optimizations through loop pragmas
- Loop pragmas
- pragma LOOP_TRIP_COUNT_MIN (min)
- pragma LOOP_TRIP_COUNT_MAX(max)
- pragma LOOP_TRIP_COUNT_MODULO(modulo)
- pragma LOOP_UNROLL(N)
- loop guard condition can be removed.
- loop trip count computation
- can be simplified (e.g., divide replaced by right
shifts). - can be replaced with a constant count value.
- alternate loop during software pipelining may not
be generated. - remainder loops during unrolling (LNO GC) may
not be needed. - Unrolling the standard benefits
- Less loop overhead by reducing the
branch/jump/comparison. - Scheduling can be better due to increased number
of operations in the loop body.
7Register promotion of small structs (structures
and unions lt 8 bytes that could fit in register)
THE WHIRL IR USING DISTINCT AUXILIARY SYMBOLS
FOR EACH FIELD IN THE SMALL STRUCT e.g., h1
st 9 w0 st8
typedef union long long d int
w2 short h4 char b8
PAIR PAIR var, Pi, Po
var.h1 3 x var.w0
I4INTCONST 3 (0x3) I2STID 2 ltst 9gt
I4I4LDID 0 ltst 8gt I4STID 0 ltst 5gt
I8I8LDID 0 ltst 4gt I4INTCONST 3 (0x3) I8CVTL
16 I8COMPOSE_BITS ltbofst16 bsize16gt I8STID 0
ltst 4gt I8I8LDID 0 ltst 4gt I8EXTRACT_BITS
ltbofst0 bsize32gt I4STID 0 ltst 5gt
EXTRACT_BITS COMPOSE_BITS TRANSFORMATION
REPLACES AUXILIARY SYMBOLS FOR EACH FIELD OF THE
SMALL STRUCT MEMBER BY THE UNIQUE FULL-SIZED
AUXILIARY REPRESENTING THE FULL STRUCTURE TYPE
e.g., PAIR st 4
8Hyperblock Formation
- Hyperblock scheduling can be important on DSPs
- Many DSP architectures support predicated
instructions - Especially profitable on control code
- Restricted form of predication on DSP targets
- Not every instruction is predicatable
- Architecture decisions mainly due to resource and
power savings - Default hyperblock algorithm rejected too many
basic blocks due to instructions that arent
predicatable in the DSP - Most potential hyperblocks on our target
contained a few basic blocks - Small hyperblocks exacerbated the problem
- Needed a more permissive version of hyperblock
formation
9Changes to Hyperblock Formation Algorithm
- Allowed a basic block (e.g. B3) to contain
non-predicatable instructions if it
post-dominates the entry block (e.g. B1) of the
hyperblock
B1 Hyperblock entry blockB3Block contains
non-predicatable instructionsB2, B4, B5 large
completely predicatable basic blocks.
- Default hyperblock algorithm would reject blocks
B1-B5 - Successfully formed hyperblock after our
modifications - Modified algorithm led to more basic blocks
considered and more hyperblocks formed. Details
in paper.
10Efficient Register Pair Allocation
- Many DSPs support 64-bit register by combining
two adjacent 32-bit registers. - Two choices
- Use single 8-byte TN
- Handle all complexities in the register
allocator. - Direct mapping of data-types from whirl to CGIR.
- Simple semantics 1 result, 2 or more operands.
- Use two 4-byte TNs
- Need to handle complexities in GRA/LRA to assign
adjacent registers. - Need to keep a mapping between the 8-byte WN to
its two 4-byte TNs. - Difficult to handle ops producing 8-byte result
(two 4-byte TNs). - Difficult to handle ops using 8-byte source (each
input is two 4-byte TNs) - Accessing lower / higher part of register pair is
trivial.
11Motivating example
include ltstdio.hgt extern long long bar(int a,
long long b) extern int baz(long long) int
foo(long long a, int c) long long tmp c
bar(1, a) int retval (tmpgtgt32) if (c gt 0)
tmp return retval baz(tmp)
- C Example
- After Expansion
- After EBO
11 TN2498 - asr_i_p TN2428 (0x20) 11
TN2524 - tfr TN2498 11 TN1(r0)4 - add
TN2484 TN2524
11 TN2524 - pseudo_pair_high GTN2428
11 GTN1(r0)4 - add GTN1(r0)4ltdefopndgt TN2524
12TN2524 - pseudo_pair_high GTN2428
- Pseudo pair instructions are expanded after
register allocation - NOP if high of GTN2428 is allocated the same
register as TN2524. - Copy otherwise.
- We want to avoid copy
- Make sure both high of GTN2428 and TN2524 share
the same register. - Problem
- LRA and GRA phases work independently.
- Live ranges of source TN and result TN can
interfere in many ways. - GLOBAL GLOBAL Both handled in
GRA - GLOBAL LOCAL Needs GRA LRA
interaction - LOCAL GLOBAL Needs GRA LRA
interaction - LOCAL LOCAL Both handled in
LRA - LRA does not build the interference graph.
13The solution strategy
- Globalize any local TN if the other TN is global
- TN2524 - pseudo_pair_high GTN2428
- Need to solve only 2 cases
- GLOBAL GLOBAL Both handled in
GRA - LOCAL LOCAL Both handled in
LRA - Minor changes in GRA
- GRA considers pseudo pair instructions as copy
- Attempts to remove copy by preferencing
- Some changes in LRA
- Do very simple liveness analysis to identify if
pseudo pair op TNs can preference i.e. share same
register - Maintain a list of preference TN in each TNs
live range - When choosing color for a TN, check if any TN in
its preference list is already allocated and get
the same register if it is available
14Results at a glance
- Enhanced Open64 v.s. baseline Open64 improvements
- WOPT enhancement e.g., on modem applications
- Cycle count reduced 3 to 40,
- stack reduced as much as 50,
- code size reduced 1 to 2
- Register-pair allocation
- Telecommunication cycle count reduced 3.91 on
average. - kernel codes cycle count reduced 1.77 on
average. - Hyperblock enhancement (illustrated in terms of
the ratio of HBs formed or BBs considered by the
modified algorithm when compared to the
default algorithm) - Networking HBs formed 1.13
BBs considered 1.72 - Telecommunication HBs formed 1.00
BBs considered 1.49