Title: Novel Algorithms in the Memory Management of MultiDimensional Signal Processing
1Novel Algorithms in the Memory Management of
Multi-Dimensional Signal Processing
- Florin Balasa
- University of Illinois at Chicago
2Outline
- The importance of memory management
- in multi-dimensional signal processing
- A lattice-based framework
- The computation of the minimum data
- memory size
- Optimization of the dynamic energy consumption
- in a hierarchical memory subsystem
- Mapping multi-dimensional signals
- into hierarchical memory organizations
- Future research directions
- Conclusions
3Memory management for signal processing
applications
Real-time multi-dimensional signal processing
systems
(video and image processing, telecommunications,
audio and speech coding, medical imaging, etc.)
data transfer and data storage
The designer must focus on the exploration of
the memory subsystem
system performance power consumption chip area
4Memory management for signal processing
applications
In the early years of high-level synthesis
- register-transfer level (RTL) algorithmic
specifications
- memory management tasks tackled at scalar level
More recently
- high-level algorithmic specifications
- memory management tasks at non-scalar level
Algebraic techniques (similar to those used in
modern compilers)
5Memory management for signal processing
applications
- Affine algorithmic
- specifications
T0 0 for ( j16 jlt512 j )Â
S0j-160 0Â Â Â for ( k0 klt8 k )
for (ij-16 iltj16 i ) Â Â Â Â Â Â Â Â
S0j-1633ki-j17 S0j-1633ki-j16
A4j Aki Tj-15 S0j-16297
Tj-16 out T497
- Loop-organized algorithmic specification
- Main data structures multi-dimensional arrays
6A Lattice-Based Framework
for (i0 ilt4 i)
for (j0 j lt 2i j lt -i6 j)
A 2i3j1 5ij3 4i6j2
z4i6j2
Axyz
j
y5ij3
i
Iterator space
Index space
x2i3j1
7A Lattice-Based Framework
for (i0 ilt4 i)
for (j0 j lt 2i j lt -i6 j)
A 2i3j1 5ij3 4i6j2
2
3
1
x
i
5
1
3
y
j
4
6
2
z
affine
Iterator space
Index space
mapping
0 lt i lt 4 , 0 lt j lt2i, j lt -i6
8A Lattice-Based Framework
Any array reference can be modeled as a
linearly bounded lattice (LBL)
LBL x Ti u Ai gt b
Iterator space
Affine mapping
- scope of nested loops, and
- iterator-dependent conditions
affine
Polytope
LBL
mapping
9A Lattice-Based Framework
for (i0 ilt4 i)
for (j0 j lt 2i j lt -i6 j)
A 2i3j1 5ij3 4i6j2
- How many memory locations are necessary
- to store the array reference
- A 2i3j1 5ij3 4i6j2
10A Lattice-Based Framework
The storage requirement of an array reference
is the size of its index space (i.e., a lattice
!!)
LBL x Ti u Ai gt b
f Zn Zm
f(i) Ti u
Is function f a one-to-one mapping ??
If YES
Size(index space) Size(iterator space)
11A Lattice-Based Framework
Computation of the size of an integer polytope
for (i0 ilt4 i)
for (j0 j lt 2i j lt -i6 j)
A 2i3j1 5ij3 4i6j2
Step 1
Find the vertices of the iterator space and their
supporting polyhedral cones
1
1
C(V1) r1 , r2
2
0
12A Lattice-Based Framework
Computation of the size of an integer polytope
(contd)
Decompose the supporting cones into unimodular
cones (Barvinoks decomposition algorithm)
Step 2
1
0
0
1
C(V1)
2
-1
1
0
Step 3
Find the generating function of each supporting
cone
1
1
F(V1)
(1-xy2) (1-y-1)
(1-y) (1-x)
Step 4
Find the number of monomials in the generating
function of the whole polytope F F(V1)
F(V2)
13The Memory Size Computation Problem
- Affine algorithmic
- specifications
T0 0 for ( j16 jlt512 j )Â
S0j-160 0Â Â Â for ( k0 klt8 k )
for (ij-16 iltj16 i ) Â Â Â Â Â Â Â Â
S0j-1633ki-j17 S0j-1633ki-j16
A4j Aki Tj-15 S0j-16297
Tj-16 out T497
What is the minimum data storage necessary to
execute an algorithm (affine specification)
- Any scalar signal must be stored only during
its lifetime
- Signals having disjoint lifetimes can share the
same location
14The Memory Size Computation Problem
- Affine algorithmic
- specifications
T0 0 for ( j16 jlt512 j )Â
S0j-160 0Â Â Â for ( k0 klt8 k )
for (ij-16 iltj16 i ) Â Â Â Â Â Â Â Â
S0j-1633ki-j17 S0j-1633ki-j16
A4j Aki Tj-15 S0j-16297
Tj-16 out T497
All the previous works proposed estimation
techniques !
The number of scalars (array elements)
153,366 The minimum data storage storage
4,763
15The Memory Size Computation Problem
define n 6
for ( j0 jltn j ) A j 0 in0
for ( i0 iltn i ) A j i1
A j i 1
for ( i0 iltn i ) alpha i A i
ni for ( j0 jltn j ) A
j ni1 j lt i ? A
j ni alpha i A j
ni for ( j0 jltn j ) B j A
j 2n
16The Memory Size Computation Problem
Decompose the LBLs of the array refs. into
disjoint lattices
17The Memory Size Computation Problem
Decomposition of the array references of signal A
(illustrative example)
18Memory Size Computation Algorithm
For every indexed signal in the algorithmic
specification, decompose the array references in
disjoint lattices
Step 1
Based on the lattice lifetime analysis, find the
memory size at the boundaries between the blocks
of code
Step 2
Analyzing the amounts of signals produced and
consumed In each block, prune the blocks of code
where the maximum storage cannot happen
Step 3
Step 4
For each of the remaining blocks, compute the
maximum memory size
- exploiting the one-to-one mapping property of
array references
- computing the maximum iterator vectors of the
scalars
19Memory trace for an SVD updating algorithm
20Memory trace for a 2-D Gaussian blur filter
algorithm
21for ( i 0 i lt 95 i ) for ( j 0 j lt
32 j ) if ( ij gt 30 ij lt 63
) A i j
if ( ij gt 62 ij lt 95 )
A i - 32 j
784 locations
for ( j 0 j lt 32 j ) for ( i 0 i lt
95 i ) if ( ij gt 30 ij lt 63
) A i j
if ( ij gt 62 ij lt 95 )
A i - 32 j
32 locations
Study the effect of loop transformations on the
data memory
22The Memory Size Computation Problem
All the previous works are estimation
techniques they are sometimes very inaccurate
- For the first time, the storage requirements of
applications - can be exactly computed using formal
techniques
The previous works have constraints on the
specifications
- This approach works for the entire class of
affine specifications
The previous works are illustrated with simple
benchmarks (in terms of array elements, array
references, lines of code)
- This approach was tested on complex benchmarks
- e.g. code with 113 loop nests 3-level
deep, - 906 array references, over 900
lines of code, - 4 million scalar signals
23Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
- Multi-dimensional arrays stored off-chip
- Copies of the frequently accessed array parts
- should be stored on-chip
On-chip scratch-pad memory
SPM
Off-chip memory
Copy candidate
Two layer model
24How to select the copy candidates?
Entire arrays unlikely
Rows/columns somewhat better
How to find array parts heavily accessed?
- The need of an array partitioning based on the
intensity of memory accesses
- Data reuse model based on lattices
25Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
A1
A15
A2
A17
A3
A16
26Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
Decomposition in disjoint lattices
Computation of exact number of memory accesses
per lattice
Lattice A17
accesses A1 13,569 accesses A2
13,569 accesses A3 13,569 accesses A17
131,625
Total accesses A17 172,332
27Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
Array space of signal A
Map of the array space based on average memory
accesses
283-D map of the array space based on the exact
number of memory accesses
Array space of signal A
29Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
Array space of signal A
Map of the array space based on average memory
accesses
30Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
Array space of signal A
Map of the array space based on average memory
accesses
31Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
- Dynamic energy computed based on number of
accesses to each memory layer
- CACTI power model
- One or two orders of magnitude between an SPM
access and an off-chip access - Energy per access is SPM size-dependent
constant for small SPM sizes (lt a few Kbytes)
Reinman 99
32Optimizing the Dynamic Energy Consumption in a
Hierarchical Memory Subsystem
33Signal-to-Memory Mapping
Base address of signal A
0
1
2
Window size of signal A
A index1 index2
Physical Memory
34Signal-to-Memory Mapping
Mapping model (can be used in hierarchical
memory organizations)
m-dim. window ( w1 , , wm )
m-dim. array
w i
Max dist. alive elements having same index i
1
A index1 indexm
mapped to
A index1 mod w1 indexm mod wm
35Bounding window (w1,w2) (4,6) Storage
requirements 4 x 6 24
Iteration (i7 , j9)
36Bounding window (w1,w2) (4,6) Storage
requirements 4 x 6 24
Iteration (i7 , j9)
37Computation of the Windowof a Lattice of Live
Signals
index2
for ( i0 ilt3 i )
for ( j0 jlt 2 j )
if ( 3i gt 2j ) A 2i3j 5ij
j
x Ti u
2
A 2i3j 5ij
1
i
index1
0
1
2
3
Iterator space
Index space
38Computation of the Windowof a Lattice of Live
Signals
index2
17
A 2i3j 5ij
2-D window
w2 18
( w1 13 , w2 18 )
by integer projection of the lattice on the axes
index1
0
12
w1 13
39Future work
- Computation of storage requirements for
high-throughput - applications, where the code contains
explicit parallelism
- Improve the algorithm that aims to optimize the
dynamic - energy consumption, extending it to an
arbitrary number - of memory layers
- Extend the hierarchical memory allocation model
to save - leakage energy
- Use area models for memories in order to
trade-off - decrease of energy consumption and the
increase of area - implied by the memory fragmentation
40Future work
- Memory management for configurable architectures
Several FPGA contain distributed RAM modules
Homogeneous architectures
RAMs of same capacity evenly distributed
(Xilinx Virtex II Pro)
Heterogeneous architectures
A variety of RAMs
(Altera Stratix II)
- Memory management for dynamically
reconfigurable systems
41Conclusions
- A general framework based on lattices for
addressing - several memory management problems
Unique features of this research
- The exact computation of the data storage
requirement - of an application IEEE TVLSI 2007
- A data reuse formal model based on partitioning
- the arrays according to the intensity of
memory accesses - ICCAD 2006
- A signal-to-memory mapping model which works
- for hierarchical memory organizations
DATE 2007
42Conclusions
design of a (hierarchical) memory
subsystem optimized for power consumption and
chip area, s.t. performance constraints, starting
from the specification of a (multi-dimensional)
signal processing application
- This topic is considered by the Semiconductor
Research - Corporation (SRC) one of the top synthesis
problems - still unsolved
- This research is interdisciplinary EECSMath
- There is interest for international
co-operation - (potential funding the NSF-PIRE program)
43Conclusions
Graduate students
Hongwei Zhu (Ph.D. defense Spring 2007)
Ilie I. Luican (Ph.D. defense Spring 2009)
Karthik Chandramouli (M.S. completed)
The End