Title: Modles et outils mathmatiques pour la compilation
1Lattice-Based Memory Allocation
Alain Darte Compsys Project Compilation and
Embedded Systems CNRS, LIP, ENS-Lyon, France
Joint work with Rob Schreiber (HP Labs) and
Gilles Villard (CNRS, LIP).
References CASES03, IEEE Transactions on
Computers (to appear).
WOG04, April. 25 th, 2004. Recent trends in
Compiler Construction. Sven Verdoolaeges PhD
Defense.
2Outline
- Introduction
- The initial context PICO, HP Labs software tool
for compiling high-level programs (e.g., C code)
into NPAs (Non Programmable Accelerators). How to
store intermediate results? - Mathematical tools for high-level program
transformations. - An example of communicating pipelined loops.
- Lattice-based memory allocation.
- Examples of previous work limitations.
- Main results and open questions.
3PiCo (Program In Chip Out)
HP Labs automatic generation of non programmable
accelerator (NPA)
Similar tools MMAlpha (Inria), Atomium (IMEC),
Compaan (Leiden) Other possible inputs
Recurrence equations, Matlab, Kahn processes
4High-Level Program Optimizations
- Program analysis dependence analysis, lifetime
analysis, footprint analysis, array expansion,
array renaming, etc. - Code and loop transformations tiling,
scheduling, nested loop transformations, modulo
scheduling, etc. - ? Well-established mathematical tools and
theory graph algorithms, polyhedral
manipulations, Hermite/Smith forms, integer
linear programming, Ehrhart polynomials, etc. - BUT
- Memory optimizations
- optimization of local memory (intra-loop buffer)
- optimization of inter-loop buffers for
communicating NPAs. - ? No suitable mathematical tools so far.
5Example DCT-like code.
First NPA do br 0, 63 do bc 0, 63
do r 0, 7 A(br, bc, r, )
enddo enddo enddo
Second NPA do br 0, 63 do bc 0, 63
do c 0, 7 A(br, bc, , c)
enddo enddo enddo
pipelined with
Memory for A
- How to schedule the computations?
- How to allocate elements of A in local memory so
as to reduce its size? - a) Full array 256K elements. b) Optimized size
112 elements (lt 2 blocks).
A(br, bc, r, c) mapped to (r mod 4, 16(brbc)
2r c mod 28)
6Outline
- Introduction.
- Lattice-based memory allocation
- Definition of modular allocations.
- Conflicting indices and critical lattices.
- Examples of limitations of previous work.
- Main results and open questions.
7Memory Reduction Problem for Arrays
Given a scheduled program (i.e., operations are
not reordered), or several communicating
programs, find the minimal memory size to store
intermediate values and an adequate memory
mapping.
- Lifetime analysis
- Schedule of computations ? Lifetime for each
value (similar to dependence analysis, exact or
over-approximated). - Memory reuse
- Values simultaneously live should not share the
same location (constraints similar to register
allocation). - Restrict to simple addressing functions (for
code generation) - canonical linearization, linear mapping in
multi-dimensional arrays - wrapping with modulo
operations (reuse). - ? All are special cases of modular memory
allocations.
8Modular Mappings
- Generalization of (rotating) registers in higher
dimensions - Value indexed by i writes in multi-dimensional
position Mi mod b, where b is a positive integral
vector, and M an integral matrix. - Ex i(i1,i2) stored at _at_ (2i1i2 mod 3,
i1i2 mod 6) ? b(3,6), size 18.
Given a schedule and a lifetime analysis, find a
valid allocation (M,b) such that the product of
the components of b (memory size) is minimized.
- Generalizes all previous approaches
- De Greef, Catthoor, De Man (1996-1997)
linearizations 1 modulo - Lefebvre, Feautrier (1996-1997) successive
modulos. - Wilde, Rajopadhye (1996), Quilleré, Rajopadhye
(2000) projections. - Strout, Carter, Ferrante, Simon (ASPLOS98) only
1 modulo. - Thies, Vivien, Sheldon, Amarasinghe (PLDI01)
same.
9Our Main Contributions
Thies et al., PLDI01 There is a need for a
technique able to consider more general storage
mappings and that would allow variations in the
number of array dimensions, while still capturing
the directional and modular reuse of the
occupancy vector.
- We identify the fundamental object to work with
- The set S of all differences of conflicting
indices. - We show the link with critical lattices
- Finding the best allocation Mi mod b among ALL
possible modular allocation amounts to find the
critical integer lattice for the set S. - We give guaranteed heuristics to approximate the
optimal - ? It explains previous work
- ? It gives new (and better) solutions
- ? It shows the link with theoretical work on
successive minima, basis reduction, Minkowskis
theorems, etc.
10Outline
- Introduction.
- Lattice-based memory allocation.
- Examples of previous work limitations
- rely on particular linearizations,
- or may wrap along the wrong axis.
- Main results and open questions.
11De Greef, Catthoor, and De Man
- Were the first to identify the need for memory
reduction techniques for embedded multimedia
applications. ? Patent (1996) for intra- and
inter-array memory reuse. - Inter-array reuse
- Geometrical heuristics for packing different
arrays in a given memory buffer. ? will not be
discussed here. - Intra-array memory reuse
- Consider each original d-dimensional array and
its 2dd! canonical linearizations. (Example in 2D
for an NxM array, look at 8 linearizations Mij,
Mi-j, -Mij, -Mi-j, iNj, i-Nj, -iNj, -i-Nj). - Compute the maximal address difference D between
two simultaneously live values. - Select the linearization with smallest distance D
and wrap the array modulo (D1).
12De Greef, Catthoor, De Man Example 1
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do i 1,N do j 1,N a(Nij mod (N1))
... b(i,j) a(Nij1 mod (N1))
enddo enddo
do i 1,N do j 1,N a(-ij mod (N1))
... b(i,j) a(-ij1 mod (N1))
enddo enddo
Column-major order (Fortran-like) iNj,
maximal distance N(N-1)1 Row-major order
(C-like) Nij, maximal distance N ? Best
canonical linearization Nij mod (N1).
13De Greef, Catthoor, De Man Example 2
How could we have missed this?
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j,j) ...
b(i,j) a(t-j-1,j) enddo enddo
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j) ...
b(i,j) a(t-j-1) enddo enddo
Any canonical linearization leads to a distance
T(N2)! But the allocation i mod N, or even i is
just fine!
14Lefebvre and Feautrier
- Developed in the context of parallelizing
compilers - a) Eliminate spurious memory dependences thanks
to single assignment form b) Wrap memory back
when possible. - Inter-array reuse
- Coloring heuristics on array names (as for
register allocation). - Intra-array memory reuse
- Idea 1 forget about original arrays, focus on
original loop indices. - Idea 2 wrap successively in each dimension with
modulos. - ? As a computational point of view, use classical
techniques based on (rational) linear programming.
15Lefebvre, Feautrier Example 1 revisited
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do i 1,N do j 1,N a(i mod 2, j)
... b(i,j) a(i-1 mod 2, j) enddo enddo
Along i, maximal distance 1 ? i mod 2. Along j
(for a fixed i), maximal distance N-1 ? j mod
N, i.e., j. ? Selected allocation (i mod 2, j),
with a memory size 2N (note N1 in previous
solution).
16Lefebvre, Feautrier Example 2 revisited
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j) ...
b(i,j) a(t-j-1) enddo enddo
Along i, maximal distance N-1 ? i mod N, i.e.,
i. Along j (for a fixed i), maximal distance 0
? no extra dimension. ? Selected allocation i mod
N, i.e., i. (Note order N2 in previous
solution)
17Lefebvre, Feautrier Example 3
do i 1,N do j 1,N a(i,j) ...
enddo enddo
pipelined 1 clock cycle later with
do i 1,N do j 1,N b(i,j)
a(i,j)... enddo enddo
- Along i, maximal distance 1 ? i mod 2
- Along j (for a fixed i), maximal distance 1 ? j
mod 2. - Selected allocation (i mod 2, j mod 2) and size
4. OK.
18Lefebvre, Feautrier Example 3 (variant)
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j,j) ...
enddo enddo
pipelined 1 clock cycle later with
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) b(t-,j)
a(t-j,j)... enddo enddo
- Along i, maximal distance N-1 ? i mod N
- Along j (for fixed i), max. dist 0 ? j mod 1.
- Corresponding memory size N!
- Same if starting with j. FAIL!
19Outline
- Introduction.
- Lattice-based memory allocation.
- Examples of previous work limitations.
- Main results and open questions
- No way to explain quickly all details, even to
experts in lattice theory and reduction theory...
- See CASES03 proceedings, research report
(http//perso.ens-lyon.fr/alain.darte) or, IEEE
TC journal version (to appear). - But I can try to
- Explain basic concepts of critical lattice and
modular allocations. - Illustrate different mechanisms.
- State results.
20There was a Need for a Framework for Memory
Reduction Based on Modular Allocations
- Lower bounds
- Given a lifetime analysis, can we give a lower
bound for the best achievable memory size? What
is the best modular memory allocation? - Upper bounds
- Can we find mechanisms leading to allocations
whose corresponding memory size is not
arbitrarily bad compared to the lower bound
(guaranteed heuristics)? - Robustness
- We need a framework that can possibly capture
parameters, that does not depend on the basis in
which the problem is described, etc. ?
Geometrical model. - Computability
- We need to make sure the mechanisms are
constructive and lead to heuristics (or
algorithms) that can be implemented.
21Set of Conflicting Index Differences
- Index description
- Choose an index description for values that are
going to share a given array (the allocation will
be linear with respect to these indices).
Typically, loop indices, array indices, etc. - Sef of conflicting index differences
- Build the set CS of pairs of conflicting (i.e.,
simultaneously live) indices (i,j), and the set
DS of differences (i-j). - We want (i,j) ? CS, i ? j ? Mi mod b ? Mj
mod b, or equivalently - d ? DS, d ? 0 ? Md mod b ? 0 ,
or equivalently
Md mod b 0, d ? DS ? d 0
22Admissible and Critical Lattices
- The kernel of (M,b)
- The set ? i Mi mod b 0 is a
full-dimensional lattice. - (M,b) is valid iff ? ? DS ? 0, i.e., ? is an
admissible lattice for DS. - Conversely
- If A is a basis for ?, admissible integral
lattice for DS, compute the Smith form A Q1 S
Q2 with Q1 and Q2 unimodular, S diag(b). - The mapping (M,b) where M is the inverse of Q1
has the kernel ?, thus is a valid allocation with
memory size det(S) det(?). - ? The modular allocation with smallest memory
size corresponds to a critical integer lattice
for DS, i.e., an admissible integer lattice for
DS with smallest determinant.
23Modular Mappings Toy Example
Corners (-1,5), (1,-5), (8,1), (-8,-1)
24Modular Mappings Toy Example
Bounding Box (i mod 9, j mod 6) ? Size 54
Corners (-1,5), (1,-5), (8,1), (-8,-1)
25Modular Mappings Toy Example
Successive modulos (i mod 9, j mod 5) ? Size
45
Corners (-1,5), (1,-5), (8,1), (-8,-1)
26Modular Mappings Toy Example
Skewed Bounding Box (i-j mod 8, j mod 6) ? Size
48
Corners (-1,5), (1,-5), (8,1), (-8,-1)
27Modular Mappings Toy Example
Skewed successive modulos (i-j mod 8, j mod 4)
? Size 32
Corners (-1,5), (1,-5), (8,1), (-8,-1)
28Modular Mappings Toy Example
Better allocation (i-j mod 7, j mod 4) ? Size
28
Corners (-1,5), (1,-5), (8,1), (-8,-1)
29Modular Mappings Toy Example
Critical lattice basis (4,3), (8,0) ? Best
allocation (3i-4j mod 24).
Corners (-1,5), (1,-5), (8,1), (-8,-1)
30Results for 0-Symmetric Convex Bodies
- We work with a 0-symmetric polytope K such that
DS ?K. (actually, we assume that the vector
spaces generated by the points in K and the
integer points in K are equal ? K is
full-dimensional) - Lower bound in terms of volume Vol(K)/2n
- Optimal solution found by optimized enumeration
ILP. - Heuristics exist with memory size ? cn Vol(K)
where cn depends on the dimension n only. ?
guaranteed heuristics. - One heuristic uses exactly Lefebvre-Feautrier
mechanism but in a well-chosen basis. Always
equivalent (i.e., with same memory size) to a
particular linearization ( 1D mapping). - Another heuristic (Rogers principle) works even
for arbitrary sets, but equivalent linearization
not clear. - In practice follow the schedule, when possible...
Reference Gruber and Lekkerkerker, Geometry of
Numbers.
31Remarks on critical lattices
- a) Hard to find the critical lattice, starting
from 3D, even for simple bodies. b) critical
integer lattice ? critical lattice for large
bodies. ? Hard to find the optimal, heuristics
needed. - Lower bound in terms of volume ?(K) ? Vol(K)/2n
- If S-S ?K, then all elements in S are mapped to
different locations ? ?(K) ? Card(S). - Minkowskis first theorem if ? is a lattice and
K is 0-symmetric with Vol(K) ? 2n det(?), then K
contains a nonzero lattice point of ?. - Gauge function F(x) inf?gt0 x in ?K is a
distance function. - Successive minima ?i(K) inf? ? 0 dim(Vect(?K
? Zn)) ? i. - Minkowskis second theorem
(2n/n!)det(?) ? ?1(K) ?n(K) ? 2ndet(?)
32Looking for the optimal solution
- Generate all possible lattices of a given
determinant - Avoid duplicates each lattice is uniquely
determined by its Hermite form (triangular
matrix). - (Remark not clear we could do the same for
non-equivalent mappings without reasoning with
the corresponding lattices.) - Check that the lattice is admissible for K,
either by ILP, or by enumeration if integer
points in K can be enumerated. - For the DCT example
- in 4D, optimal 112, there are 86.416.644
lattices to check, it takes roughly 2 days! - rewritten in 3D, optimal 112, there are 941.901
lattices to check, it takes roughly 30 minutes. - ? Feasible only for small sets K and small
dimensions.
33Rogers heuristic adapted
- Choose n positive integers ?i such that ?i is a
multiple of ?(i1) and dim(Li) ? i-1 where Li
Vect(K/?i ? Zn). - Choose a basis (a1, , an) of Zn s. t. Li?
Vect(a1, , ai-1). - Define ? the lattice generated by the vectors ?i
ai. - ? det(?) ? n! Vol(K)
34 Heuristic based on K (i.e., lattice)
- Choose n linearly independent integer vectors
(a1, , an) - Compute Fi(ai) inf F(y) y in ai Vect(a1,
, ai-1). - Choose n integers ?i such that ?i Fi(ai) gt 1.
- Define ? the lattice generated by the vectors ?i
ai. - ? det(?) ? (n!)2 Vol(K) if Fi(ai) ? 1 for all i
35 Heuristic based on K (i.e., mapping)
- K dual (or polar reciprocal) of K y y.x ?
1 for all x in K - K K, F related to F, Vol(K) related to
Vol(K), successive minima related, etc. - Choose n linearly independent integer vectors
(c1, , cn) - Compute Fi(ci) supci.x x in K, c1.x
ci-1.x 0 - Choose n integers ?i such that ?i gtFi(ci).
- Define the mapping (M,b) with the ci as rows of M
and b?. - ? det(?) ? (n!)2 Vol(K) if Fi(ci) ? 1 for all
I - ? Dual of the previous heuristic. Exactly
Lefebvre-Feautrier in a well-chosen basis.
36 Important practical factors
- The set DS can be skewed for 3 reasons
- Skewed iteration sub-domain with respect to full
domain. - Skewed schedule with respect to iteration domain.
- Skewed access function when reasoning with array
indices. - ? In practice, following the schedule -- if it
is expressed as a basis -- is not too bad. - ? But, ad-hoc counter-examples can be built. And
schedule basis may be hidden in a linearized
schedule.
37Open or On-Going Questions
- How much do we loose if we restrict to 1D
mappings? - How much do we loose, when restricting to modular
mappings, compared to MAXLIVE? - Mixing both Lefebvre-Feautrier (successive
modulos) and Quilleré-Rajopadhye (choice of
basis) is often ok (i.e., follow the schedule and
wrap). Can we quickly identify when? - How costly and how good are the heuristics in
practice? - How to handle more general cases (union of
polyhedra for conflicting differences, multiple
arrays, etc.). - Can this be used as a basis for solving the
general problem (i.e., find the schedule with
minimal memory requirements)? - Fully implemented in Cl_at_K parameters still in
progress