Title: Evaluation of Offset Assignment Heuristics
1Evaluation of Offset Assignment Heuristics
- Johnny Huynh, Jose Nelson Amaral, Paul Berube
- University of Alberta, Canada
- Sid-Ahmed-Ali Touati
- Universite de Versailles, France
2Outline
- Background
- Traditional Approach to Offset Assignment
- Simple Offset Assignment
- Address-Register Assignment
- Improving the Problem Model
- Optimal Address-Code Generation
- Memory Layout Permutations
- Evaluating Current Heuristics
- Methodology
- Results
- Conclusions and Future Work
3Outline
- Background
- Traditional Approach to Offset Assignment
- Simple Offset Assignment
- Address-Register Assignment
- Improving the Problem Model
- Optimal Address-Code Generation
- Memory Layout Permutations
- Evaluating Current Heuristics
- Methodology
- Results
- Conclusions and Future Work
4Background
- Digital Signal Processors (DSPs) have few general
purpose registers - Program variables kept in memory
- Address Registers (AR) used to access variables
- After a variable is accessed, the AR can be
auto-incremented (or decremented) by one word in
the same cycle.
5Processor Model
- Texas Instruments TMS320C54X DSP family
- Accumulator-based DSP
- 8 Address Registers
- Initializing an address register requires 2
cycles of overhead - Explicit address computations require 1 cycle of
overhead - Using auto-increment (or auto-decrement) has no
overhead.
6Processor ModelExample add A and B, store
in accumulator
- AR0 A
- ACC AR0
- AR0 AR0 2
- ACC AR0
0x1000 0x1001 0x1002 0x1000
0x1001 0x1002
Auto-Increment
Explicit address computation
7Processor ModelExample add A and B, store
in accumulator
- AR0 A
- ACC AR0
- AR0 AR0 2
- ACC AR0
0x1000 0x1001 0x1002 0x1000
0x1001 0x1002
Auto-Increment
Explicit address computation
8The Offset-Assignment Problem
- Given k address registers and a basic block
accessing n variables, find a memory layout that
minimizes address-computation overhead. - How should the variables be placed in memory?
- Which register should access each variable?
9Outline
- Background
- Traditional Approach to Offset Assignment
- Simple Offset Assignment
- Address-Register Assignment
- Improving the Problem Model
- Optimal Address-Code Generation
- Memory Layout Permutations
- Evaluating Current Heuristics
- Methodology
- Results
- Conclusions and Future Work
10Traditional Approach to Offset Assignment
Access Sequence
Basic Block
Generate Access Sequence
11Traditional ApproachSimple Offset Assignment
(SOA)
- In 1992, Bartley introduced the simplest form of
the offset assignment problem - Given a single address register and basic block
with n variables, find a memory layout that
minimizes overhead. - Equivalent to finding a maximum weight path cover
(NP-complete) - Many researchers have proposed heuristics for
this problem - Liao et. al. (1996)
- Leupers and Marwedel (1996)
- Sugino et. al. (1996)
12Simple Offset Assignment (SOA)
- Fix the access sequence
- Assume only one address register (k 1)
- Find an ordering of variables in memory (memory
layout) that has minimum overhead.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layout
F
C
2
2
2
D
E
13Simple Offset Assignment (SOA)
- Create Access Graph G (V, E)
- V variables
- weight of edge is the frequency of consecutive
accesses - A path defines a memory layout -- Find the
Maximum Weight Path Cover - NP-Complete!
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layout
F
C
2
2
2
D
E
14Simple Offset Assignment (SOA)
- Create Access Graph G (V, E)
- V variables
- weight of edge is the frequency of consecutive
accesses - A path defines a memory layout -- Find the
Maximum Weight Path Cover - NP-Complete!
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layout
F
C
2
2
2
D
E
15Traditional ApproachGeneral Offset Assignment
(GOA)
- Problem presented by Liao et. al. in 1996.
- Given k address registers, and a basic block with
n variables, find an assignment of variables to
address registers that minimizes the total
overhead of all registers. - This problem formulation is more accurately
described as Address-Register Assignment (ARA). - Consists of SOA problems, and is at least
NP-hard. - Many researchers have proposed heuristics for
address-register assignment - Leupers and Marwedel (1996)
- Sugino et. al. (1996)
- Zhuang et. al. (2003)
16General Offset Assignment (GOA)
- Fix the access sequence
- Allow multiple address registers (kgt1)
- Find an ordering of variables in memory (memory
layout) that has minimum overhead. - Assign each variable to an address register to
form access sub-sequences.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Sub-sequence1 a b c b c a Sub-sequence2 d
e f e f d
F
C
2
2
2
D
E
17General Offset Assignment (GOA)
- Each sub-sequence can be viewed as an independent
SOA problem. - Solve each sub-sequence as independent SOA
problems. - More appropriate to call this problem the Address
Register Assignment (ARA) problem. - Requires solving SOA instances, so is at least
NP-hard.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Sub-sequence1 a b c b c a Sub-sequence2 d
e f e f d
F
C
D
E
2
18General Offset Assignment (GOA)
- Each sub-sequence can be viewed as an independent
SOA problem. - Solve each sub-sequence as independent SOA
problems. - More appropriate to call this problem the Address
Register Assignment (ARA) problem. - Requires solving SOA instances, so is at least
NP-hard.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
19Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
20Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
21Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
22Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
23Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
24Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
25Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
D
E
2
AR1
AR0
26Address-Code Generation
- Recall that variables are assigned to address
registers. - There is nothing left to decide each address
register has a defined sequence of accesses. - Imposes a restriction that all access to a
variable is done by a single address register.
A
B
2
Ex. Access Sequence a d b e c f b e c f a
d Memory Layouts
F
C
Requires Explicit Address Computations
D
E
2
AR1
AR0
27Traditional Approach to Offset Assignment
a d b e c f b e c f a d
Address Register Assignment
d e f e f d
a b c b c a
Sub-sequence and memory layout accessed by AR0
Sub-sequence and memory layout accessed by AR1
Simple Offset Assignment
Simple Offset Assignment
a, b, c
d, e, f
28Outline
- Background
- Traditional Approach to Offset Assignment
- Simple Offset Assignment
- Address-Register Assignment
- Improving the Problem Model
- Optimal Address-Code Generation
- Memory Layout Permutations
- Evaluating Current Heuristics
- Methodology
- Results
- Conclusions and Future Work
29Optimal Address-Code Generation
- Given a fixed access sequence and memory layout,
it is possible to generate optimal
addressing-code in polynomial time - Minimum-Cost Circulation (Gebotys, 1997)
- Minimum-Weight Perfect Matching (Udayanarayanan,
2000)
30Optimal Address-Code Generation
- Build a network-flow graph
- Vertices represent variable accesses
- For each access ai that occurs before another aj,
there is an edge (ai,aj) (not all shown the
graph). - Edges represent an opportunity for a register to
access variables. - Each unit flow represents the accesses performed
by an address register. - Optimal Address-Code is found by finding a
minimum-cost circulation.
31Traditional Approach to Offset Assignment
Access Sequence
Address Register Assignment
NP-Hard
Sub-Sequence
Sub-Sequence
Sub-Sequence
Simple Offset Assignment
Simple Offset Assignment
Simple Offset Assignment
NP-Complete
Sub-Layout
Sub-Layout
Sub-Layout
Address-Code Generation
Solved, but not used!
Address-Computation Overhead
32Memory Layout Permutations (MLP)
- Since optimal address-code generation algorithms
exist, they can be applied after a memory layout
is formed (by traditional approaches). - However, the traditional approach generates
multiple sub-layouts that were originally assumed
to be independent. - How is a single memory layout formed from a set
of sub-layouts?
33Memory Layout Permutations
- Let Mi be a memory sub-layout.
- Let Mir be the reciprocal of Mi
- Given an access sequence and m memory
sub-layouts, arrange (M1M1r),,(MmMmr), such
that overhead is minimum when the sub-layouts are
placed contiguously in memory.
34a d b e c f b e c f a d
Memory Layout Permutations Example
Address Register Assignment
This is an optimal address register
assignment These are optimal simple offset
assignments All possible Memory Layout
Permutations (all have cost gt 4) Optimal Layout
b, c, a, d, e, f with cost 4 is not found
d e f e f d
a b c b c a
Simple Offset Assignment
Simple Offset Assignment
a, b, c
d, e, f
Memory Layout Permutations
a, b, c, d, e, f, f, e, d, c, b, a c, b, a,
d, e, f, f, e, d, a, b, c a, b, c, f, e, d,
d, e, f, c, b, a c, b, a, f, e, d, d, e, f,
a, b, c
35Outline
- Background
- Traditional Approach to Offset Assignment
- Simple Offset Assignment
- Address-Register Assignment
- Improving the Problem Model
- Optimal Address-Code Generation
- Memory Layout Permutations
- Evaluating Current Heuristics
- Methodology
- Results
- Conclusions and Future Work
36Experimental MethodologyEvaluating the Solution
Space
- Testcases are DSP code kernels from the UTDSP
benchmark suite. - Use gcc to obtain access sequences.
- The quality of a memory layout is evaluated using
the minimum-cost circulation technique. - The entire solution space is found for each
access sequence, to be used as a point of
reference.
37Experimental MethodologyEvaluating Current
Heuristics
Access Sequence
- Identified and implemented three Address-Register
Assignment heuristic algorithms - Leupers
- Sugino
- Zhuang
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
38Experimental MethodologyEvaluating Current
Heuristics
Access Sequence
- Identified and implemented five Simple Offset
Assignment heuristic algorithms - Liao
- Leupers
- ALOMA
- Order-First Use (OFU)
- Branch and Bound (BB)
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
39Experimental MethodologyEvaluating Current
Heuristics
Access Sequence
- Each combination of ARA and SOA algorithm
generates a set of sub-layouts. - All possible memory layout permutations are
generated, forming a set of memory layouts. - Each memory layout is evaluated using the
Minimum-Cost Circulation technique.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
40Results
- The 15 combinations of algorithms produce 15
distributions overhead values. - The distributions are aggregated into one
distribution. - The aggregate distributions represent the
solution space of all current algorithms.
41Results
- Memory layouts have a significant impact on
overhead. - Some layouts have 100 higher overhead than the
minimum. - Over 99 of all layouts have an overhead that is
50 higher than the minimum.
42Results
- Memory layouts produced by traditional approaches
have a large range of possible overhead values --
sometimes the same as the entire solution space
itself. - In some cases, no combination of ARA and SOA
heuristics can produce an optimal layout.
43Results
- Memory layouts produced by traditional approaches
have a large range of possible overhead values --
sometimes the same as the entire solution space
itself. - In some cases, no combination of ARA and SOA
heuristics can produce an optimal layout.
44Distribution of Overhead ValuesTestcase
iir_arr_swp -- infinite impulse response filter
45Exhaustive Solution SpaceTestcase iir_arr_swp
-- infinite impulse response filter
46Algorithmic Solution SpaceTestcase iir_arr_swp
-- infinite impulse response filter
47Efficiency of SOA Algorithms
Access Sequence
- For each SOA algorithm, combine with each of the
5 ARA algorithms to generate 5 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
48Efficiency of SOA Algorithms
Access Sequence
- For each SOA algorithm, combine with each of the
5 ARA algorithms to generate 5 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
49Efficiency of SOA Algorithms
Access Sequence
- For each SOA algorithm, combine with each of the
5 ARA algorithms to generate 5 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
50Efficiency of SOA Algorithms
Access Sequence
- For each SOA algorithm, combine with each of the
5 ARA algorithms to generate 5 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
51Efficiency of SOA Algorithms
Access Sequence
- For each SOA algorithm, combine with each of the
5 ARA algorithms to generate 5 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
52Efficiency of SOA AlgorithmsTestcase
iir_arr_swp -- infinite impulse response filter
53Efficiency of SOA AlgorithmsTestcase
iir_arr_swp -- infinite impulse response filter
54Evaluating SOA AlgorithmsTestcase latnrm_ptr --
normalized lattice filter
55Efficiency of ARA Algorithms
Access Sequence
- For each ARA algorithm, combine with each of the
3 SOA algorithms to generate 3 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
56Efficiency of ARA Algorithms
Access Sequence
- For each ARA algorithm, combine with each of the
3 SOA algorithms to generate 3 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
57Efficiency of ARA Algorithms
Access Sequence
- For each ARA algorithm, combine with each of the
3 SOA algorithms to generate 3 distributions of
overhead values. - The distributions can be aggregated to form a
single distribution.
Leupers
Sugino
Zhuang
Sub-Sequences
Liao
Leupers
ALOMA
OFU
BB
Sub-Layouts
Memory Layout Permutations
Memory Layouts
Compute Overhead for each layout via Minimum-Cost
Circulation
Distribution of Overhead values
58Efficiency of ARA AlgorithmsTestcase
iir_arr_swp -- infinite impulse response filter
59Efficiency of ARA AlgorithmsTestcase
iir_arr_swp -- infinite impulse response filter
60Evaluating ARA AlgorithmsTestcase latnrm_ptr --
normalized lattice filter
61Evaluating Offset Assignment Algorithms
- There is low variability between SOA algorithms
-- may be attributed to small problem sizes. - The choice of ARA algorithm has more impact on
overhead. Much of the variability attributed to
the different number of address registers used. - For all combinations of SOA and ARA algorithms,
the permutation of sub-layouts affects the
overhead.
62Outline
- Background
- Traditional Approach to Offset Assignment
- Simple Offset Assignment
- Address-Register Assignment
- Improving the Problem Model
- Optimal Address-Code Generation
- Memory Layout Permutations
- Evaluating Current Heuristics
- Methodology
- Results
- Conclusions and Future Work
63Conclusions
- The objective is to minimize address-computation
overhead. - Given a fixed access sequence and memory layout,
the minimum-cost circulation (MCC) technique can
minimize overhead. - Offset assignment algorithms should be evaluated
with MCC. - Offset assignment still has a significant impact
on overhead. - To be effective, current offset assignment
algorithms (ARA,SOA) must address the Memory
Layout Permutation problem.
64Future Work
- A new algorithm is needed to generate memory
layouts that will minimize overhead as computed
by the Minimum-Cost Flow technique. - Address-computation overhead must be minimized
for loop bodies and for variables that are live
between basic blocks and procedures.
65References
- Gebotys, C. DSP address optimization using a
minimum cost circulation technique. Proceedings
of the 1997 IEEE/ACM International Conference on
Computer-Aided Design. 100-103. - Leupers, R., Marwedel, P. Algorithms for address
assignment in DSP code generation. Proceedins of
the 1996 IEEE/ACM International Conference on
Computer-Aided Design. 109-112. - Liao, S., Devadas, S., Keutzer, K., Tjiang, S.,
Wang, A. Storage assignment to decrease code
size. ACM Transactions of Programming Languages
and Systems 18(3) (1996). 235-253. - Sugino, N., Iimuro, S., Nishihara, A., Jujii, N.
DSP code optimization utilizing memory addressing
operation. IEICE Transaction Fundamentals 8
(1996). 1217-1223. - Zhuang, X., Lau, C., Pande, S. Storage
assignment optimizations through variable
coalescence for embedded processors. Proceedings
of the 2003 ACM SIGPLAN Conference on Language,
Compiler, and Tools for Embedded Systems.
220-231. - Bartley, D.H. Optimizing stack frame accesses
for processors with restricted addressing modes.
Software Practice Experience 22(2) (2001).
158-172.
66Questions?