Modles et outils mathmatiques pour la compilation - PowerPoint PPT Presentation

About This Presentation

Title:

Modles et outils mathmatiques pour la compilation

Description:

Mathematical tools for high-level program transformations. ... Wilde, Rajopadhye (1996), Quiller , Rajopadhye (2000): projections. ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 38

Provided by: alain52

Category:

more less

Transcript and Presenter's Notes

Title: Modles et outils mathmatiques pour la compilation

1
Lattice-Based Memory Allocation
Alain Darte Compsys Project Compilation and
Embedded Systems CNRS, LIP, ENS-Lyon, France
Joint work with Rob Schreiber (HP Labs) and
Gilles Villard (CNRS, LIP).
References CASES03, IEEE Transactions on
Computers (to appear).
WOG04, April. 25 th, 2004. Recent trends in
Compiler Construction. Sven Verdoolaeges PhD
Defense.
2
Outline

Introduction
The initial context PICO, HP Labs software tool
for compiling high-level programs (e.g., C code)
into NPAs (Non Programmable Accelerators). How to
store intermediate results?
Mathematical tools for high-level program
transformations.
An example of communicating pipelined loops.
Lattice-based memory allocation.
Examples of previous work limitations.
Main results and open questions.

3
PiCo (Program In Chip Out)
HP Labs automatic generation of non programmable
accelerator (NPA)

Similar tools MMAlpha (Inria), Atomium (IMEC),
Compaan (Leiden) Other possible inputs
Recurrence equations, Matlab, Kahn processes
4
High-Level Program Optimizations

Program analysis dependence analysis, lifetime
analysis, footprint analysis, array expansion,
array renaming, etc.
Code and loop transformations tiling,
scheduling, nested loop transformations, modulo
scheduling, etc.
? Well-established mathematical tools and
theory graph algorithms, polyhedral
manipulations, Hermite/Smith forms, integer
linear programming, Ehrhart polynomials, etc.
BUT
Memory optimizations
optimization of local memory (intra-loop buffer)
optimization of inter-loop buffers for
communicating NPAs.
? No suitable mathematical tools so far.

5
Example DCT-like code.
First NPA do br 0, 63 do bc 0, 63
do r 0, 7 A(br, bc, r, )
enddo enddo enddo
Second NPA do br 0, 63 do bc 0, 63
do c 0, 7 A(br, bc, , c)
enddo enddo enddo
pipelined with
Memory for A

How to schedule the computations?
How to allocate elements of A in local memory so
as to reduce its size?
a) Full array 256K elements. b) Optimized size
112 elements (lt 2 blocks).

A(br, bc, r, c) mapped to (r mod 4, 16(brbc)
2r c mod 28)
6
Outline

Introduction.
Lattice-based memory allocation
Definition of modular allocations.
Conflicting indices and critical lattices.
Examples of limitations of previous work.
Main results and open questions.

7
Memory Reduction Problem for Arrays
Given a scheduled program (i.e., operations are
not reordered), or several communicating
programs, find the minimal memory size to store
intermediate values and an adequate memory
mapping.

Lifetime analysis
Schedule of computations ? Lifetime for each
value (similar to dependence analysis, exact or
over-approximated).
Memory reuse
Values simultaneously live should not share the
same location (constraints similar to register
allocation).
Restrict to simple addressing functions (for
code generation)
canonical linearization, linear mapping in
multi-dimensional arrays
wrapping with modulo
operations (reuse).
? All are special cases of modular memory
allocations.

8
Modular Mappings

Generalization of (rotating) registers in higher
dimensions
Value indexed by i writes in multi-dimensional
position Mi mod b, where b is a positive integral
vector, and M an integral matrix.
Ex i(i1,i2) stored at _at_ (2i1i2 mod 3,
i1i2 mod 6) ? b(3,6), size 18.

Given a schedule and a lifetime analysis, find a
valid allocation (M,b) such that the product of
the components of b (memory size) is minimized.

Generalizes all previous approaches
De Greef, Catthoor, De Man (1996-1997)
linearizations 1 modulo
Lefebvre, Feautrier (1996-1997) successive
modulos.
Wilde, Rajopadhye (1996), Quilleré, Rajopadhye
(2000) projections.
Strout, Carter, Ferrante, Simon (ASPLOS98) only
1 modulo.
Thies, Vivien, Sheldon, Amarasinghe (PLDI01)
same.

9
Our Main Contributions
Thies et al., PLDI01 There is a need for a
technique able to consider more general storage
mappings and that would allow variations in the
number of array dimensions, while still capturing
the directional and modular reuse of the
occupancy vector.

We identify the fundamental object to work with
The set S of all differences of conflicting
indices.
We show the link with critical lattices
Finding the best allocation Mi mod b among ALL
possible modular allocation amounts to find the
critical integer lattice for the set S.
We give guaranteed heuristics to approximate the
optimal
? It explains previous work
? It gives new (and better) solutions
? It shows the link with theoretical work on
successive minima, basis reduction, Minkowskis
theorems, etc.

10
Outline

Introduction.
Lattice-based memory allocation.
Examples of previous work limitations
rely on particular linearizations,
or may wrap along the wrong axis.
Main results and open questions.

11
De Greef, Catthoor, and De Man

Were the first to identify the need for memory
reduction techniques for embedded multimedia
applications. ? Patent (1996) for intra- and
inter-array memory reuse.
Inter-array reuse
Geometrical heuristics for packing different
arrays in a given memory buffer. ? will not be
discussed here.
Intra-array memory reuse
Consider each original d-dimensional array and
its 2dd! canonical linearizations. (Example in 2D
for an NxM array, look at 8 linearizations Mij,
Mi-j, -Mij, -Mi-j, iNj, i-Nj, -iNj, -i-Nj).
Compute the maximal address difference D between
two simultaneously live values.
Select the linearization with smallest distance D
and wrap the array modulo (D1).

12
De Greef, Catthoor, De Man Example 1
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do i 1,N do j 1,N a(Nij mod (N1))
... b(i,j) a(Nij1 mod (N1))
enddo enddo
do i 1,N do j 1,N a(-ij mod (N1))
... b(i,j) a(-ij1 mod (N1))
enddo enddo
Column-major order (Fortran-like) iNj,
maximal distance N(N-1)1 Row-major order
(C-like) Nij, maximal distance N ? Best
canonical linearization Nij mod (N1).
13
De Greef, Catthoor, De Man Example 2
How could we have missed this?
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j,j) ...
b(i,j) a(t-j-1,j) enddo enddo
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j) ...
b(i,j) a(t-j-1) enddo enddo
Any canonical linearization leads to a distance
T(N2)! But the allocation i mod N, or even i is
just fine!
14
Lefebvre and Feautrier

Developed in the context of parallelizing
compilers
a) Eliminate spurious memory dependences thanks
to single assignment form b) Wrap memory back
when possible.
Inter-array reuse
Coloring heuristics on array names (as for
register allocation).
Intra-array memory reuse
Idea 1 forget about original arrays, focus on
original loop indices.
Idea 2 wrap successively in each dimension with
modulos.
? As a computational point of view, use classical
techniques based on (rational) linear programming.

15
Lefebvre, Feautrier Example 1 revisited
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do i 1,N do j 1,N a(i mod 2, j)
... b(i,j) a(i-1 mod 2, j) enddo enddo
Along i, maximal distance 1 ? i mod 2. Along j
(for a fixed i), maximal distance N-1 ? j mod
N, i.e., j. ? Selected allocation (i mod 2, j),
with a memory size 2N (note N1 in previous
solution).
16
Lefebvre, Feautrier Example 2 revisited
do i 1,N do j 1,N a(i,j) ...
b(i,j) a(i-1,j) enddo enddo
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j) ...
b(i,j) a(t-j-1) enddo enddo
Along i, maximal distance N-1 ? i mod N, i.e.,
i. Along j (for a fixed i), maximal distance 0
? no extra dimension. ? Selected allocation i mod
N, i.e., i. (Note order N2 in previous
solution)
17
Lefebvre, Feautrier Example 3
do i 1,N do j 1,N a(i,j) ...
enddo enddo
pipelined 1 clock cycle later with
do i 1,N do j 1,N b(i,j)
a(i,j)... enddo enddo

Along i, maximal distance 1 ? i mod 2
Along j (for a fixed i), maximal distance 1 ? j
mod 2.
Selected allocation (i mod 2, j mod 2) and size
4. OK.

18
Lefebvre, Feautrier Example 3 (variant)
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) a(t-j,j) ...
enddo enddo
pipelined 1 clock cycle later with
do t 2,2N / t ij / do j
max(1,t-N),min(N,t-1) b(t-,j)
a(t-j,j)... enddo enddo

Along i, maximal distance N-1 ? i mod N
Along j (for fixed i), max. dist 0 ? j mod 1.
Corresponding memory size N!
Same if starting with j. FAIL!

19
Outline

Introduction.
Lattice-based memory allocation.
Examples of previous work limitations.
Main results and open questions
No way to explain quickly all details, even to
experts in lattice theory and reduction theory...
See CASES03 proceedings, research report
(http//perso.ens-lyon.fr/alain.darte) or, IEEE
TC journal version (to appear).
But I can try to
Explain basic concepts of critical lattice and
modular allocations.
Illustrate different mechanisms.
State results.

20
There was a Need for a Framework for Memory
Reduction Based on Modular Allocations

Lower bounds
Given a lifetime analysis, can we give a lower
bound for the best achievable memory size? What
is the best modular memory allocation?
Upper bounds
Can we find mechanisms leading to allocations
whose corresponding memory size is not
arbitrarily bad compared to the lower bound
(guaranteed heuristics)?
Robustness
We need a framework that can possibly capture
parameters, that does not depend on the basis in
which the problem is described, etc. ?
Geometrical model.
Computability
We need to make sure the mechanisms are
constructive and lead to heuristics (or
algorithms) that can be implemented.

21
Set of Conflicting Index Differences

Index description
Choose an index description for values that are
going to share a given array (the allocation will
be linear with respect to these indices).
Typically, loop indices, array indices, etc.
Sef of conflicting index differences
Build the set CS of pairs of conflicting (i.e.,
simultaneously live) indices (i,j), and the set
DS of differences (i-j).
We want (i,j) ? CS, i ? j ? Mi mod b ? Mj
mod b, or equivalently
d ? DS, d ? 0 ? Md mod b ? 0 ,
or equivalently

Md mod b 0, d ? DS ? d 0
22
Admissible and Critical Lattices

The kernel of (M,b)
The set ? i Mi mod b 0 is a
full-dimensional lattice.
(M,b) is valid iff ? ? DS ? 0, i.e., ? is an
admissible lattice for DS.
Conversely
If A is a basis for ?, admissible integral
lattice for DS, compute the Smith form A Q1 S
Q2 with Q1 and Q2 unimodular, S diag(b).
The mapping (M,b) where M is the inverse of Q1
has the kernel ?, thus is a valid allocation with
memory size det(S) det(?).
? The modular allocation with smallest memory
size corresponds to a critical integer lattice
for DS, i.e., an admissible integer lattice for
DS with smallest determinant.

23
Modular Mappings Toy Example
Corners (-1,5), (1,-5), (8,1), (-8,-1)
24
Modular Mappings Toy Example
Bounding Box (i mod 9, j mod 6) ? Size 54
Corners (-1,5), (1,-5), (8,1), (-8,-1)
25
Modular Mappings Toy Example
Successive modulos (i mod 9, j mod 5) ? Size
45
Corners (-1,5), (1,-5), (8,1), (-8,-1)
26
Modular Mappings Toy Example
Skewed Bounding Box (i-j mod 8, j mod 6) ? Size
48
Corners (-1,5), (1,-5), (8,1), (-8,-1)
27
Modular Mappings Toy Example
Skewed successive modulos (i-j mod 8, j mod 4)
? Size 32
Corners (-1,5), (1,-5), (8,1), (-8,-1)
28
Modular Mappings Toy Example
Better allocation (i-j mod 7, j mod 4) ? Size
28
Corners (-1,5), (1,-5), (8,1), (-8,-1)
29
Modular Mappings Toy Example
Critical lattice basis (4,3), (8,0) ? Best
allocation (3i-4j mod 24).
Corners (-1,5), (1,-5), (8,1), (-8,-1)
30
Results for 0-Symmetric Convex Bodies

We work with a 0-symmetric polytope K such that
DS ?K. (actually, we assume that the vector
spaces generated by the points in K and the
integer points in K are equal ? K is
full-dimensional)
Lower bound in terms of volume Vol(K)/2n
Optimal solution found by optimized enumeration
ILP.
Heuristics exist with memory size ? cn Vol(K)
where cn depends on the dimension n only. ?
guaranteed heuristics.
One heuristic uses exactly Lefebvre-Feautrier
mechanism but in a well-chosen basis. Always
equivalent (i.e., with same memory size) to a
particular linearization ( 1D mapping).
Another heuristic (Rogers principle) works even
for arbitrary sets, but equivalent linearization
not clear.
In practice follow the schedule, when possible...

Reference Gruber and Lekkerkerker, Geometry of
Numbers.
31
Remarks on critical lattices

a) Hard to find the critical lattice, starting
from 3D, even for simple bodies. b) critical
integer lattice ? critical lattice for large
bodies. ? Hard to find the optimal, heuristics
needed.
Lower bound in terms of volume ?(K) ? Vol(K)/2n
If S-S ?K, then all elements in S are mapped to
different locations ? ?(K) ? Card(S).
Minkowskis first theorem if ? is a lattice and
K is 0-symmetric with Vol(K) ? 2n det(?), then K
contains a nonzero lattice point of ?.
Gauge function F(x) inf?gt0 x in ?K is a
distance function.
Successive minima ?i(K) inf? ? 0 dim(Vect(?K
? Zn)) ? i.
Minkowskis second theorem

(2n/n!)det(?) ? ?1(K) ?n(K) ? 2ndet(?)
32
Looking for the optimal solution

Generate all possible lattices of a given
determinant
Avoid duplicates each lattice is uniquely
determined by its Hermite form (triangular
matrix).
(Remark not clear we could do the same for
non-equivalent mappings without reasoning with
the corresponding lattices.)
Check that the lattice is admissible for K,
either by ILP, or by enumeration if integer
points in K can be enumerated.
For the DCT example
in 4D, optimal 112, there are 86.416.644
lattices to check, it takes roughly 2 days!
rewritten in 3D, optimal 112, there are 941.901
lattices to check, it takes roughly 30 minutes.
? Feasible only for small sets K and small
dimensions.

33
Rogers heuristic adapted

Choose n positive integers ?i such that ?i is a
multiple of ?(i1) and dim(Li) ? i-1 where Li
Vect(K/?i ? Zn).
Choose a basis (a1, , an) of Zn s. t. Li?
Vect(a1, , ai-1).
Define ? the lattice generated by the vectors ?i
ai.
? det(?) ? n! Vol(K)

34
Heuristic based on K (i.e., lattice)

Choose n linearly independent integer vectors
(a1, , an)
Compute Fi(ai) inf F(y) y in ai Vect(a1,
, ai-1).
Choose n integers ?i such that ?i Fi(ai) gt 1.
Define ? the lattice generated by the vectors ?i
ai.
? det(?) ? (n!)2 Vol(K) if Fi(ai) ? 1 for all i

35
Heuristic based on K (i.e., mapping)

K dual (or polar reciprocal) of K y y.x ?
1 for all x in K
K K, F related to F, Vol(K) related to
Vol(K), successive minima related, etc.
Choose n linearly independent integer vectors
(c1, , cn)
Compute Fi(ci) supci.x x in K, c1.x
ci-1.x 0
Choose n integers ?i such that ?i gtFi(ci).
Define the mapping (M,b) with the ci as rows of M
and b?.
? det(?) ? (n!)2 Vol(K) if Fi(ci) ? 1 for all
I
? Dual of the previous heuristic. Exactly
Lefebvre-Feautrier in a well-chosen basis.

36
Important practical factors

The set DS can be skewed for 3 reasons
Skewed iteration sub-domain with respect to full
domain.
Skewed schedule with respect to iteration domain.
Skewed access function when reasoning with array
indices.
? In practice, following the schedule -- if it
is expressed as a basis -- is not too bad.
? But, ad-hoc counter-examples can be built. And
schedule basis may be hidden in a linearized
schedule.

37
Open or On-Going Questions

How much do we loose if we restrict to 1D
mappings?
How much do we loose, when restricting to modular
mappings, compared to MAXLIVE?
Mixing both Lefebvre-Feautrier (successive
modulos) and Quilleré-Rajopadhye (choice of
basis) is often ok (i.e., follow the schedule and
wrap). Can we quickly identify when?
How costly and how good are the heuristics in
practice?
How to handle more general cases (union of
polyhedra for conflicting differences, multiple
arrays, etc.).
Can this be used as a basis for solving the
general problem (i.e., find the schedule with
minimal memory requirements)?
Fully implemented in Cl_at_K parameters still in
progress