A Framework for Parallelizing LoadStores on Embedded Processors

About This Presentation

Title:

Description:

Number of Views:22

Avg rating:3.0/5.0

Slides: 23

Provided by: c200

Category:

more less

Transcript and Presenter's Notes

Title: A Framework for Parallelizing LoadStores on Embedded Processors

1
A Framework for Parallelizing Load/Stores on
Embedded Processors

2
Background and Motivation

Speed gap between memory and CPU remains
Multi-bank memory architecture Motorola DSP56000
series, NEC 77016, SONY pDSP, Analog Devices
ADSP-210x, Starcore SC140 processor core
Parallel instructions allow parallel access to
memory banks PLDXY r1, _at_a, r2, _at_b, loads _at_a?r1
and _at_b?r2 at the same time.
Objective
Try to maximally generate parallel Load/Store
(such as PLDXY) instructions through compiler
optimizations.
Controlled code data segment growth
Reasonable speed of compilation

3
General approaches

Model as ILP problem--Rainer Leupers, Daniel
Kotte, Variable partitioning for dual memory
bank DSPs, ICASSP, May01
Variables Ni with value 0/1 for each LD/ST instr.
to represent its memory bank assignment (X or Y)
Variables Eij with value 0/1 to represent whether
two instructions can be merged
Enforcing other constraints and max the selected
edge weight
Model as Graph problem--A.Sudarsanam, S.Malik,
Simultaneous Reference Allocation in Code
Generation for Dual Data Memory Bank ASIPs,
TODAES, Apr00
Each Load/Store as a node
Edge between nodes represents they can be merged
Pick maximal number of edges that are disjoint

4
Major contributions

Keep the model simple and easy to be solved
mathematically
Identify the movable boundary problem, which
impedes the problem modeling and simplification
Propose Motion Schedule Graph (MSG) and two
approaches to solve it heuristically
Merge with instruction duplication and variable
duplication
Cross basic block merges
Other improvements like local conflict
elimination through rematerialization and some
global optimization issues
An iterative approach, which systematically grows
the code segment and then the data segment
minimally.

5
Basic concepts (1)

Post-pass approach assuming a good register
allocator has been used--Appel Georges
register allocation algorithm
Alias analysis
Memory access instruction dis-ambiguity
Most alias can be uniquely determined in our
benchmark program
Memory access instructions
STaddr,r is the definition of a memory address
LDaddr,r is the use of a memory address
For base-offset Load/Store instructions, normally
for arrays, assume arrays are inseparable and
more register conflicts will be considered.
DependenciesAlias analysis
Address conflicts
Register conflicts

6
Basic concepts (2)

Building Webs
Webs maximal union of du-chains. All variable
def/use on the web MUST be allocate to the same
memory location
One variable appears in separate web can be put
into different memory locations
Achieve value separation
Motion range determination
Defined as interval between program points where
a Load/Store can be legally moved, restrained by
dependencies
Load/Store instructions with overlapping range
MAY be merged
Notice for Movable Boundary problem

7
Movable boundary problem

The motion boundary of one Load/Store instruction
is also a Load/Store instruction
Assuming fixed boundary will cause incorrect
merge

8
Motion schedule graph

Pseudo fixed-boundary
For Store move as early as possible assuming
other instructions are fixed
For Load move as late as possible assuming other
instructions are fixed
Motion Schedule Graph
Nodes represent individual Load/Store
instructions
Oval encloses Load/Store on the same web
Edges link nodes that have overlapped motion
range (with respect to pseudo fixed-boundaries)

9
Conflict resolution
10
Example
11
Graph solving

The whole problem is provably NP-completerefer
to Appendix A
Two separate problems Bank Assignment and Edge
Picking
For predetermined bank assignments, the Edge
Picking problem can be optimally solved in
polynomial time
Heuristic algorithms
Brutal force searching will take O(V32n) time.
Doable for small programs
SA can approach the optimal solution but will
greatly increase the compilation time
Use heuristic to solve bank assignment, then get
optimal solution for Edge Picking

12
Edge Picking as max flow problem
13
Bank assignment heuristic
14
Post-pass phases
15
Cross BB merge (Instr. duplication)

Move to predecessor/successor to create new
opportunities
To guarantee profitability
Move to where the reference is live
Move ST on EBB
Move LD on reverse EBB
Make sure can be combined if pushed to at least
one of the live predecessors/successors

16
Variable duplication
17
Local conflict elimination

Motivation
Register allocator may assign same register to
neighboring ranges, which leads to register
conflicts
ISA restrictions may need particular registers
but not available at the program point
Rematerialization to free a register and
reconstruct it after the merge to make the
register available.

18
Merge type and MSG properties
19
Compilation time
20
Runtime performance
21
Code size comparison
22
Conclusion

A framework to analyze and merge LD/STs.
Our heuristic approach comes close to exhaustive
search with less compilation time.
Enhancing the range of motion of the instructions
by undertaking variable and instruction
replications, so the generated code quality is
superior to the exhaustive methods previously
proposed.

Write a Comment

User Comments (0)