Qcc Loop Optimization Presentation - PowerPoint PPT Presentation

About This Presentation

Title:

Qcc Loop Optimization Presentation

Description:

... the circular/bit-reverse buffer based on buffer size and the start address. The load/store and circular/bit-reverse update combined in the CG phase: ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 15

Provided by: alex135

Learn more at: https://www.capsl.udel.edu

Category:

more less

Transcript and Presenter's Notes

Title: Qcc Loop Optimization Presentation

1
Development of an efficient DSP Compiler based
on Open64
Subrato K. De, Anshuman Dasgupta, Sundeep
Kushwaha, Tony Linthicum, Susan Brownhill,
Sergei Larin, Taylor Simpson Qualcomm
Incorporated, San Diego Austin, USA.
Email sde, adasgupt, sundeepk, tlinth, yzhu,
slarin, tsimpson_at_qualcomm.com
2
Introduction

DSP specific extensions to the C/C language and
support in Open64 for efficient usage of DSP
hardware features.
Enhancement of WOPT with impact on DSPs for
mobile devices.
Modification of the hyperblock scheduler for DSP
with limited predication.
Register pair allocation.
Results at a glance.

3
DSP specific extensions to C/C use of circular
load/store intrinsics
Hex Address Word data
0x0F00
0x0F04
0x0F08
0x0F0C
0x0F10
0x0F14
int coeffPointer for(i0 i lt noOfInputSamples
i) sum 0 coeffPointer coeff0
for(j 0 j lt 4 j) sum
inputSampleij coeffPointer
outputSamplei sum
int coeffPointer int value for(i0 i lt
noOfInputSamples i) sum 0 for(j 0 j
lt 4 j) LOAD_CIRC_INT(value,
coeffPointer, 1, 4, coeff0) sum
inputSampleij value outputSamplei
sum
4
DSP specific extensions to C/C use of
bit-reversed load/store intrinsics
unsigned int ReverseBits (unsigned int index,
unsigned
int NumBits ) unsigned int i, rev for (
irev0 i lt NumBits i ) rev (rev ltlt 1)
(index 1) index gtgt 1 return rev
Standard Bit Reverse Bit
000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111
for ( i0 i lt NumSamples i ) j
ReverseBits ( i, NumBits ) RealOutj
RealIni ImagOutj ImagIni
int realOutPointer int imagOutPointer for (
i0 i lt NumSamples i ) STORE_BREV_INT(RealIn
i, realOutPointer, 1, NumSamples,RealOut0
). STORE_BREV_INT(ImagIni, imagOutPointer, 1,
NumSamples, ImagOut0 ).
5
Design and implementation of circular and
bit-reverse load/store intrinsics

Macros expansions.
define LOAD_CIRC_INT(v, p, s, l, a) \
( (v) ( p) (p) (int )
circ_update((void ) (p), (s), (l), (void ) (a))
)
define STORE_CIRC_INT(v, p, s, l, a) \
( ( p) (v) (p) (int )
circ_update((void ) (p), (s), (l), (void ) (a))
)
The WHIRL IR
ltresultgt load_indirect(pointer) OR
store_indirect(pointer) ltsourcegt
ltpointergt circ_update (pointer, step, CR).
Where, CR is the configuration register that
defines the circular/bit-reverse buffer based on
buffer size and the start address.
The load/store and circular/bit-reverse update
combined in the CG phase
ltresult, pointergt load_circular_update(pointer,
step, CR), OR
ltpointergt store_circular_update(source, pointer,
step, CR)
CRs are considered dedicated temporary names
(TNs)
Algorithm for the hoisting of the loop-invariant
CR assignment statements, and
Algorithm for CR allocation, when multiple target
supports multiple CRs.

6
DSP specific extensions to the C/C use and
optimizations through loop pragmas

Loop pragmas
pragma LOOP_TRIP_COUNT_MIN (min)
pragma LOOP_TRIP_COUNT_MAX(max)
pragma LOOP_TRIP_COUNT_MODULO(modulo)
pragma LOOP_UNROLL(N)
loop guard condition can be removed.
loop trip count computation
can be simplified (e.g., divide replaced by right
shifts).
can be replaced with a constant count value.
alternate loop during software pipelining may not
be generated.
remainder loops during unrolling (LNO GC) may
not be needed.
Unrolling the standard benefits
Less loop overhead by reducing the
branch/jump/comparison.
Scheduling can be better due to increased number
of operations in the loop body.

7
Register promotion of small structs (structures
and unions lt 8 bytes that could fit in register)
THE WHIRL IR USING DISTINCT AUXILIARY SYMBOLS
FOR EACH FIELD IN THE SMALL STRUCT e.g., h1
st 9 w0 st8
typedef union long long d int
w2 short h4 char b8
PAIR PAIR var, Pi, Po
var.h1 3 x var.w0
I4INTCONST 3 (0x3) I2STID 2 ltst 9gt
I4I4LDID 0 ltst 8gt I4STID 0 ltst 5gt
I8I8LDID 0 ltst 4gt I4INTCONST 3 (0x3) I8CVTL
16 I8COMPOSE_BITS ltbofst16 bsize16gt I8STID 0
ltst 4gt I8I8LDID 0 ltst 4gt I8EXTRACT_BITS
ltbofst0 bsize32gt I4STID 0 ltst 5gt
EXTRACT_BITS COMPOSE_BITS TRANSFORMATION
REPLACES AUXILIARY SYMBOLS FOR EACH FIELD OF THE
SMALL STRUCT MEMBER BY THE UNIQUE FULL-SIZED
AUXILIARY REPRESENTING THE FULL STRUCTURE TYPE
e.g., PAIR st 4
8
Hyperblock Formation

Hyperblock scheduling can be important on DSPs
Many DSP architectures support predicated
instructions
Especially profitable on control code
Restricted form of predication on DSP targets
Not every instruction is predicatable
Architecture decisions mainly due to resource and
power savings
Default hyperblock algorithm rejected too many
basic blocks due to instructions that arent
predicatable in the DSP
Most potential hyperblocks on our target
contained a few basic blocks
Small hyperblocks exacerbated the problem
Needed a more permissive version of hyperblock
formation

9
Changes to Hyperblock Formation Algorithm

Allowed a basic block (e.g. B3) to contain
non-predicatable instructions if it
post-dominates the entry block (e.g. B1) of the
hyperblock

B1 Hyperblock entry blockB3Block contains
non-predicatable instructionsB2, B4, B5 large
completely predicatable basic blocks.

Default hyperblock algorithm would reject blocks
B1-B5
Successfully formed hyperblock after our
modifications
Modified algorithm led to more basic blocks
considered and more hyperblocks formed. Details
in paper.

10
Efficient Register Pair Allocation

Many DSPs support 64-bit register by combining
two adjacent 32-bit registers.
Two choices
Use single 8-byte TN
Handle all complexities in the register
allocator.
Direct mapping of data-types from whirl to CGIR.
Simple semantics 1 result, 2 or more operands.
Use two 4-byte TNs
Need to handle complexities in GRA/LRA to assign
adjacent registers.
Need to keep a mapping between the 8-byte WN to
its two 4-byte TNs.
Difficult to handle ops producing 8-byte result
(two 4-byte TNs).
Difficult to handle ops using 8-byte source (each
input is two 4-byte TNs)
Accessing lower / higher part of register pair is
trivial.

11
Motivating example
include ltstdio.hgt extern long long bar(int a,
long long b) extern int baz(long long) int
foo(long long a, int c) long long tmp c
bar(1, a) int retval (tmpgtgt32) if (c gt 0)
tmp return retval baz(tmp)

C Example
After Expansion
After EBO

11 TN2498 - asr_i_p TN2428 (0x20) 11
TN2524 - tfr TN2498 11 TN1(r0)4 - add
TN2484 TN2524
11 TN2524 - pseudo_pair_high GTN2428
11 GTN1(r0)4 - add GTN1(r0)4ltdefopndgt TN2524

12
TN2524 - pseudo_pair_high GTN2428

Pseudo pair instructions are expanded after
register allocation
NOP if high of GTN2428 is allocated the same
register as TN2524.
Copy otherwise.
We want to avoid copy
Make sure both high of GTN2428 and TN2524 share
the same register.
Problem
LRA and GRA phases work independently.
Live ranges of source TN and result TN can
interfere in many ways.
GLOBAL GLOBAL Both handled in
GRA
GLOBAL LOCAL Needs GRA LRA
interaction
LOCAL GLOBAL Needs GRA LRA
interaction
LOCAL LOCAL Both handled in
LRA
LRA does not build the interference graph.

13
The solution strategy

Globalize any local TN if the other TN is global
TN2524 - pseudo_pair_high GTN2428
Need to solve only 2 cases
GLOBAL GLOBAL Both handled in
GRA
LOCAL LOCAL Both handled in
LRA
Minor changes in GRA
GRA considers pseudo pair instructions as copy
Attempts to remove copy by preferencing
Some changes in LRA
Do very simple liveness analysis to identify if
pseudo pair op TNs can preference i.e. share same
register
Maintain a list of preference TN in each TNs
live range
When choosing color for a TN, check if any TN in
its preference list is already allocated and get
the same register if it is available

14
Results at a glance

Enhanced Open64 v.s. baseline Open64 improvements
WOPT enhancement e.g., on modem applications
Cycle count reduced 3 to 40,
stack reduced as much as 50,
code size reduced 1 to 2
Register-pair allocation
Telecommunication cycle count reduced 3.91 on
average.
kernel codes cycle count reduced 1.77 on
average.
Hyperblock enhancement (illustrated in terms of
the ratio of HBs formed or BBs considered by the
modified algorithm when compared to the
default algorithm)
Networking HBs formed 1.13
BBs considered 1.72
Telecommunication HBs formed 1.00
BBs considered 1.49