A Framework for Parallelizing LoadStores on Embedded Processors - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

A Framework for Parallelizing LoadStores on Embedded Processors

Description:

Move to predecessor/successor to create new opportunities. To ... Make sure: can be combined if pushed to at least one of the live predecessors/successors ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 23
Provided by: c200
Category:

less

Transcript and Presenter's Notes

Title: A Framework for Parallelizing LoadStores on Embedded Processors


1
A Framework for Parallelizing Load/Stores on
Embedded Processors
  • Xiaotong Zhuang
  • Santosh Pande
  • John S. Greenland Jr.
  • College of Computing, Georgia Tech

2
Background and Motivation
  • Speed gap between memory and CPU remains
  • Multi-bank memory architecture Motorola DSP56000
    series, NEC 77016, SONY pDSP, Analog Devices
    ADSP-210x, Starcore SC140 processor core
  • Parallel instructions allow parallel access to
    memory banks PLDXY r1, _at_a, r2, _at_b, loads _at_a?r1
    and _at_b?r2 at the same time.
  • Objective
  • Try to maximally generate parallel Load/Store
    (such as PLDXY) instructions through compiler
    optimizations.
  • Controlled code data segment growth
  • Reasonable speed of compilation

3
General approaches
  • Model as ILP problem--Rainer Leupers, Daniel
    Kotte, Variable partitioning for dual memory
    bank DSPs, ICASSP, May01
  • Variables Ni with value 0/1 for each LD/ST instr.
    to represent its memory bank assignment (X or Y)
  • Variables Eij with value 0/1 to represent whether
    two instructions can be merged
  • Enforcing other constraints and max the selected
    edge weight
  • Model as Graph problem--A.Sudarsanam, S.Malik,
    Simultaneous Reference Allocation in Code
    Generation for Dual Data Memory Bank ASIPs,
    TODAES, Apr00
  • Each Load/Store as a node
  • Edge between nodes represents they can be merged
  • Pick maximal number of edges that are disjoint

4
Major contributions
  • Keep the model simple and easy to be solved
    mathematically
  • Identify the movable boundary problem, which
    impedes the problem modeling and simplification
  • Propose Motion Schedule Graph (MSG) and two
    approaches to solve it heuristically
  • Merge with instruction duplication and variable
    duplication
  • Cross basic block merges
  • Other improvements like local conflict
    elimination through rematerialization and some
    global optimization issues
  • An iterative approach, which systematically grows
    the code segment and then the data segment
    minimally.

5
Basic concepts (1)
  • Post-pass approach assuming a good register
    allocator has been used--Appel Georges
    register allocation algorithm
  • Alias analysis
  • Memory access instruction dis-ambiguity
  • Most alias can be uniquely determined in our
    benchmark program
  • Memory access instructions
  • STaddr,r is the definition of a memory address
  • LDaddr,r is the use of a memory address
  • For base-offset Load/Store instructions, normally
    for arrays, assume arrays are inseparable and
    more register conflicts will be considered.
  • DependenciesAlias analysis
  • Address conflicts
  • Register conflicts

6
Basic concepts (2)
  • Building Webs
  • Webs maximal union of du-chains. All variable
    def/use on the web MUST be allocate to the same
    memory location
  • One variable appears in separate web can be put
    into different memory locations
  • Achieve value separation
  • Motion range determination
  • Defined as interval between program points where
    a Load/Store can be legally moved, restrained by
    dependencies
  • Load/Store instructions with overlapping range
    MAY be merged
  • Notice for Movable Boundary problem

7
Movable boundary problem
  • The motion boundary of one Load/Store instruction
    is also a Load/Store instruction
  • Assuming fixed boundary will cause incorrect
    merge

8
Motion schedule graph
  • Pseudo fixed-boundary
  • For Store move as early as possible assuming
    other instructions are fixed
  • For Load move as late as possible assuming other
    instructions are fixed
  • Motion Schedule Graph
  • Nodes represent individual Load/Store
    instructions
  • Oval encloses Load/Store on the same web
  • Edges link nodes that have overlapped motion
    range (with respect to pseudo fixed-boundaries)

9
Conflict resolution
10
Example
11
Graph solving
  • The whole problem is provably NP-completerefer
    to Appendix A
  • Two separate problems Bank Assignment and Edge
    Picking
  • For predetermined bank assignments, the Edge
    Picking problem can be optimally solved in
    polynomial time
  • Heuristic algorithms
  • Brutal force searching will take O(V32n) time.
    Doable for small programs
  • SA can approach the optimal solution but will
    greatly increase the compilation time
  • Use heuristic to solve bank assignment, then get
    optimal solution for Edge Picking

12
Edge Picking as max flow problem
13
Bank assignment heuristic
14
Post-pass phases
15
Cross BB merge (Instr. duplication)
  • Move to predecessor/successor to create new
    opportunities
  • To guarantee profitability
  • Move to where the reference is live
  • Move ST on EBB
  • Move LD on reverse EBB
  • Make sure can be combined if pushed to at least
    one of the live predecessors/successors

16
Variable duplication
17
Local conflict elimination
  • Motivation
  • Register allocator may assign same register to
    neighboring ranges, which leads to register
    conflicts
  • ISA restrictions may need particular registers
    but not available at the program point
  • Rematerialization to free a register and
    reconstruct it after the merge to make the
    register available.

18
Merge type and MSG properties
19
Compilation time
20
Runtime performance
21
Code size comparison
22
Conclusion
  • A framework to analyze and merge LD/STs.
  • Our heuristic approach comes close to exhaustive
    search with less compilation time.
  • Enhancing the range of motion of the instructions
    by undertaking variable and instruction
    replications, so the generated code quality is
    superior to the exhaustive methods previously
    proposed.
Write a Comment
User Comments (0)
About PowerShow.com