Code Selection for Media Processors with SIMD Instructions - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Code Selection for Media Processors with SIMD Instructions

Description:

The primary goal of traditional code selectors is ... Consistency for Write/read to/from 16-bit CSE. For two nodes two be packed into a SIMD pair, ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 29

Provided by: carl294

Category:

more less

Transcript and Presenter's Notes

Title: Code Selection for Media Processors with SIMD Instructions

1
Code Selection for Media Processors with SIMD
Instructions

Rainer Leupers
DATE00

2
Code Selection

Source ? Intermediate Representation (IR) ?
Assembly
Back end consists of
code selection
optimization, etc
The primary goal of traditional code selectors is
to find the minimum cost machine code that is
equivalent to a given IR
cost latency, size,
the best known code selector twig by Aho et
al.TOPLAS89
In this paper, extend twig to handle SIMD
instructions
SIMD instruction manipulates data-stored
subregisters instead of full registers (in this
paper)

front end
back end
3
Code Generation Using Tree Matching and Dynamic
Programming (Twig)

Alfred V. Aho et al.
TOPLAS89

4
Code Generation by Tree Matching An Example
(4)

IR of ai b
Output of a front-end
Input of a code generator

(3)
(2)
(1)

Machine code of the IR
Output of the code generator

(1)
(2)
(3)
(4)
5
Code Generation by Tree Matching An Example

Reduction rules

6
Code Generation by Tree Matching An Example
i0, ca
i0, jSP
7
Code Generation by Tree Matching An Example

What if several matches exist?
Ties are broken by cost
How can the global cost be considered at local
decision steps?
By dynamic programming!

?
8
Code Generation by Tree Matching An Example
9
Code Generation by Tree Matching An Example

The minimum-cost cover of the IR of ai b !

10
Dynamic Programming for Minimum-Cost Covering

Informally, a necessary and sufficient condition
for a problem to have an optimal dynamic
programming algorithm is that
the problem exhibits optimal substructure!
For the problem of optimal code generation, OK
Partition the problem of generating optimal code
for an expression E into subproblems of
generating optimal code for the subexpressions
and later merge.. (with some manipulation..)

11
When Common Subexpressions Exists..
?
b a1 c a1

The problem of optimal code generation becomes
intractable
The algorithm does not guarantee an optimal
solution

12
Code Selection for Media Processors with SIMD
Instructions

Rainer Leupers
DATE00

13
Code Selection

Source ? Intermediate Representation (IR) ?
Assembly
Back end consists of
code selection
optimization, etc
The primary goal of traditional code selectors is
to find the minimum cost machine code that is
equivalent to a given IR
cost latency, size,
the best known code selector twig by Aho et
al.TOPLAS89
In this paper, extend twig to handle SIMD
instructions
SIMD instruction manipulates data-stored
subregisters instead of full registers (in this
paper)

front end
back end
14
SIMD Instructions

Media Processors..
TI C62xx, Philips Trimedia, Intel MMX
Typically, word length of media processors 32
bits. But,
audio data 16 bits, video data 8 bits ? waste
of resources !
So, they provide special instructions, called
SIMD instr.
virtually split each full register into multiple
subregisters and
perform identical computations on the
subregisters in parallel
But, existing code selection techniques are not
applicable..
Ad hoc techniques compiler intrinsics,
hand-optimized libraries

15
Why cant traditional code selectors be used for
SIMD?

Recall Ahos algorithm
Processes one data flow tree (DFT) after another
But, to select SIMD instructions, it requires to
simultaneously cover multiple DFTs !
Then, why dont we select SIMD instructions after
(traditional) code selection step?
Register allocation.. that is,
if multiple values share a single 32-bit
register, their live ranges may interfere !
So, we should be much more conservative. (c.f.
Gebotys, DAC00)

16
An Example of Parallelization

SIMD instructions
ADD2
Two 16 bits add ops
LOAD2
Two 16 bits load ops
STORE2
Two 16 bits store ops

17
A Difficulty in Exploiting SIMD Instructions

Parallel load store..
How to check that the memory address is
contiguous
In terms of PL community, must-alias analysis
In this paper, traditional data flow analysis was
used
For non-toy programs, more strong analysis
techniques needed (e.g., abstract interpretation,
constraint-based analysis)

18
Overview of the Code Generator

1 Identify contiguous load/store operations
2 Add pseudo reduction rules to the original
DFTs
3 Find alternative covers for the modified DFTs
by Ahos dynamic programming algorithm (Covering
Phase)
- may contain invalid covers (cannot avoid
when directly applying Ahos algorithm)
4 Select a valid cover that maximizes the use
of SIMD instructions (Selection Phase)
- transform into ILP formulation
- the validity constraint can be easily
expressed in the ILP formulation

19
Covering Selection Phases..

Recall Ahos algorithm
When find a cover,
by dynamic programming
in a bottom-up manner
only one optimal solution is stored for each node
Ties are broken arbitrarily
Selecting phase is trivial
The uniquely stored cover directly corresponds to
machine instructions
For the problem with SIMD instrs., why not find a
global optimum solution at covering phase?
Enforcing validity condition is difficult at
covering phase if ahos dynamic programming
covering algorithm is to be used..
So, at covering phase, find alternative solutions
among which a global optimal solution exists, and
then
Select the optimal solution by enforcing
validity condition

20
Covering Phases An Example

a1 b1c1
a2 b2c2
b1 mem1 (16 bits)
b2 mem2 (16 bits)
c1 .. , c2

2
reg_lo

reg_lo
reg_lo
reg
2
reg
reg
reg_hi
reg_hi
reg_hi
a1
a2
load
reg

addr

c2
c1
b2
b1
fetch
fetch
fetch
fetch
mem2

mem1

21
Selection Phases

What is a necessary and sufficient condition for
validity ?
Each node is covered by exactly one rule
Type of parent Types of children
Consistency for Write/read to/from 16-bit CSE
For two nodes two be packed into a SIMD pair,
Yet another artifact constraint to avoid
scheduling deadlock
Cyclic dependence due to true dependence false
(output) dependence pair
Subject to these constraints, select a cover that
maximizes the use of SIMD instructions
These can be expressed by ILP formulation

22
Constraint 1