Title: Code Selection for Media Processors with SIMD Instructions
1Code Selection for Media Processors with SIMD
Instructions
2Code Selection
- Source ? Intermediate Representation (IR) ?
Assembly - Back end consists of
- code selection
- optimization, etc
- The primary goal of traditional code selectors is
- to find the minimum cost machine code that is
equivalent to a given IR - cost latency, size,
- the best known code selector twig by Aho et
al.TOPLAS89 - In this paper, extend twig to handle SIMD
instructions - SIMD instruction manipulates data-stored
subregisters instead of full registers (in this
paper)
front end
back end
3Code Generation Using Tree Matching and Dynamic
Programming (Twig)
- Alfred V. Aho et al.
- TOPLAS89
4Code Generation by Tree Matching An Example
(4)
- IR of ai b
- Output of a front-end
- Input of a code generator
(3)
(2)
(1)
- Machine code of the IR
- Output of the code generator
(1)
(2)
(3)
(4)
5Code Generation by Tree Matching An Example
6Code Generation by Tree Matching An Example
i0, ca
i0, jSP
7Code Generation by Tree Matching An Example
- What if several matches exist?
- Ties are broken by cost
- How can the global cost be considered at local
decision steps? - By dynamic programming!
?
8Code Generation by Tree Matching An Example
9Code Generation by Tree Matching An Example
- The minimum-cost cover of the IR of ai b !
10Dynamic Programming for Minimum-Cost Covering
- Informally, a necessary and sufficient condition
for a problem to have an optimal dynamic
programming algorithm is that - the problem exhibits optimal substructure!
- For the problem of optimal code generation, OK
- Partition the problem of generating optimal code
for an expression E into subproblems of
generating optimal code for the subexpressions - and later merge.. (with some manipulation..)
11When Common Subexpressions Exists..
?
b a1 c a1
- The problem of optimal code generation becomes
intractable - The algorithm does not guarantee an optimal
solution
12Code Selection for Media Processors with SIMD
Instructions
13Code Selection
- Source ? Intermediate Representation (IR) ?
Assembly - Back end consists of
- code selection
- optimization, etc
- The primary goal of traditional code selectors is
- to find the minimum cost machine code that is
equivalent to a given IR - cost latency, size,
- the best known code selector twig by Aho et
al.TOPLAS89 - In this paper, extend twig to handle SIMD
instructions - SIMD instruction manipulates data-stored
subregisters instead of full registers (in this
paper)
front end
back end
14SIMD Instructions
- Media Processors..
- TI C62xx, Philips Trimedia, Intel MMX
- Typically, word length of media processors 32
bits. But, - audio data 16 bits, video data 8 bits ? waste
of resources ! - So, they provide special instructions, called
SIMD instr. - virtually split each full register into multiple
subregisters and - perform identical computations on the
subregisters in parallel - But, existing code selection techniques are not
applicable.. - Ad hoc techniques compiler intrinsics,
hand-optimized libraries
15Why cant traditional code selectors be used for
SIMD?
- Recall Ahos algorithm
- Processes one data flow tree (DFT) after another
- But, to select SIMD instructions, it requires to
simultaneously cover multiple DFTs ! - Then, why dont we select SIMD instructions after
(traditional) code selection step? - Register allocation.. that is,
- if multiple values share a single 32-bit
register, their live ranges may interfere ! - So, we should be much more conservative. (c.f.
Gebotys, DAC00)
16An Example of Parallelization
- SIMD instructions
- ADD2
- Two 16 bits add ops
- LOAD2
- Two 16 bits load ops
- STORE2
- Two 16 bits store ops
17A Difficulty in Exploiting SIMD Instructions
- Parallel load store..
- How to check that the memory address is
contiguous - In terms of PL community, must-alias analysis
- In this paper, traditional data flow analysis was
used - For non-toy programs, more strong analysis
techniques needed (e.g., abstract interpretation,
constraint-based analysis)
18Overview of the Code Generator
- 1 Identify contiguous load/store operations
- 2 Add pseudo reduction rules to the original
DFTs - 3 Find alternative covers for the modified DFTs
by Ahos dynamic programming algorithm (Covering
Phase) - - may contain invalid covers (cannot avoid
when directly applying Ahos algorithm) - 4 Select a valid cover that maximizes the use
of SIMD instructions (Selection Phase) - - transform into ILP formulation
- - the validity constraint can be easily
expressed in the ILP formulation
19Covering Selection Phases..
- Recall Ahos algorithm
- When find a cover,
- by dynamic programming
- in a bottom-up manner
- only one optimal solution is stored for each node
- Ties are broken arbitrarily
- Selecting phase is trivial
- The uniquely stored cover directly corresponds to
machine instructions - For the problem with SIMD instrs., why not find a
global optimum solution at covering phase? - Enforcing validity condition is difficult at
covering phase if ahos dynamic programming
covering algorithm is to be used.. - So, at covering phase, find alternative solutions
among which a global optimal solution exists, and
then - Select the optimal solution by enforcing
validity condition
20Covering Phases An Example
- a1 b1c1
- a2 b2c2
- b1 mem1 (16 bits)
- b2 mem2 (16 bits)
- c1 .. , c2
2
reg_lo
reg_lo
reg_lo
reg
2
reg
reg
reg_hi
reg_hi
reg_hi
a1
a2
load
reg
addr
c2
c1
b2
b1
fetch
fetch
fetch
fetch
mem2
mem1
21Selection Phases
- What is a necessary and sufficient condition for
validity ? - Each node is covered by exactly one rule
- Type of parent Types of children
- Consistency for Write/read to/from 16-bit CSE
- For two nodes two be packed into a SIMD pair,
- Yet another artifact constraint to avoid
scheduling deadlock - Cyclic dependence due to true dependence false
(output) dependence pair - Subject to these constraints, select a cover that
maximizes the use of SIMD instructions - These can be expressed by ILP formulation
22Constraint 1
- Each node is covered by exactly one rule
covered by r_j or not
23Constraint 2
- Type of parent Types of children
24Constraint 3
- Consistency for Write/read to/from 16-bit CSE
25Constraint 4
- For two nodes two be packed into a SIMD pair,
26Constraint 5
- Yet another artifact constraint to avoid
scheduling deadlock - Cyclic dependence due to true dependence false
(output) dependence pair
27Optimization Goal
- Subject to these constraints, select a cover that
maximizes the use of SIMD instructions
28Experimental Results