Title: Processor Architectures and Program Mapping
1Processor Architectures and Program Mapping
Data Memory Management Part a Overview
- 5kk10 TU/e
- Henk Corporaal
- Jef van Meerbergen
- Bart Mesman
2Data Memory Management Overview
- Motivation
- Example application
- DMM steps
- Results
- Notes
- We concentrate on Static Data structures like
arrays - The Data Transfer and Storage Exploration
(DTSE)methodology, on which these slides are
based, has been developed at IMEC, Leuven
3Design flow
4The underlying idea
for (i0iltni) for (j0 jlt3 j) for
(k1 klt7 k) Bj Ai4k
5Platform example TriMedia
6Platform architecture model
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
7Data transfer and storage power
8What about delay of memories?
- Global wiring delay becomes dominant over gate
delay
9Positioning in the Y-chart
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
10Mapping
- Given
- architecture e.g. TriMedia TM1000
- reference C code for applicatione.g. MPEG-4
Motion Estimation - Task
- map application on architecture
- But wait a moment
- me_at_workgt tmcc -o mpeg4_me mpeg4_me.cThank you
for running TriMedia compiler.Your program uses
257321886 bytes,78 Watt, and 428798765291 clock
cycles
11Lets help the compiler ...DTSE data transfer
and storage exploration
- Transforms C-code of the application
- By focusing on multi-dimensional signals (arrays)
- To better exploit platform capabilities
- This overview covers the major steps to improve
power, area, performance trade-off in the context
of platform based design
12Application example
- Application domain
- Computer Tomography in medical imaging
- Algorithm
- Cavity detection in CT-scans
- Detect dark regions in successive images
- Indicate cavity in brain
Bad news for owner of brain
13Application
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y
- Reference (conceptual) C code for the algorithm
- all functions image_inN x Mt-1 -gt image_outN
x Mt - new value of pixel depends on its neighbors
- neighbor pixels read from background memory
- approximately 110 lines of C code (ignoring file
I/O etc) - experiments with N x M 640 x 400 pixels
- straightforward implementation 6 image buffers
14DMM (data mem. mgt.) principles
Off-chip SDRAM
Exploit limited life-time
15DMM steps
C-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
Address optimization
C-out
16The DM steps
- Preprocessing
- Rewrite code in 3 layers (parts)
- Selective inlining, Single Assignment form, ....
- Data flow transformations
- Eliminate redundant transfers and storage
- Loop and control flow transformations
- Improve regularity of accesses and data locality
- Data re-use and memory hierarchy layer assignment
- Determine when to move which data between
memories to meet the cycle budget of the
application with low cost - Determine in which layer to put the arrays (and
copies)
17The DM steps
- Per memory layer
- Cycle budget distribution
- determine memory access constraints for given
cycle budget - Memory allocation and assignment
- which memories to use, and where to put the
arrays - Data layout
- determine how to combine and put arrays into
memories - Address optimization on the final C-code
18Preprocessing Dividing an application in the 3
layers
Module1a
LAYER1
Module2
Module3
Module1b
- testbench call
- dynamic event behaviour
Synchronisation
- mode selection
LAYER2
int
func1(int a, int b)
LAYER3
return ab
19Layered code structure
main() / Layer 1 code /
read_image(IN_NAME, image_in) cav_detect()
write_image(image_out)
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
20Layered code structure
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
/ Makes code for data access
/ / and data transfer explicit /
int foo(int arg1) / Layer 3 / /
arithmetic, data-dependent operations to be
mapped to data-path, controller /
21Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y)
gauss_x_imagexy0
for (x1 xltN-2 x) for (y1 yltM-2 y)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(
k) gauss_x_imagexy
foo(gauss_x_tmp)
accesses N M (N-2) (M-2)
22Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y) if
((xgt1 xltN-2) (ygt1 yltM-2))
gauss_x_tmp 0 for (k-1 klt1
k) gauss_x_tmp image_inxkyGau
ssabs(k) gauss_x_imagexy
foo(gauss_x_tmp) else
gauss_x_imagexy 0
accesses N M gain is almost 50
23Data-flow transformation
- In total 5 types of data-flow transformations
- advanced signal substitution and (copy)
propagation - algebraic transformations (associativity etc.)
- shifting delay lines
- re-computation
- transformations to eliminate bottlenecksfor
subsequent loop transformations
24Data-flow transformation - result
25Loop transformations
- Loop transformations
- improve regularity of accesses
- improve temporal locality production ?
consumption - Expected influence
- reduce temporary storage and (anticipated)
background storage
26Global loop transformation steps applied to
cavity detection
- Make all loop dimensions equal
- Regularize loop traversalY and X loop
interchange - follow order of input stream
- Y loop folding and global mergingX loop folding
and global merging - full, global scope regularity
- nearly complete locality for main signals
27Data enters Cavity Detectorrow-wise
serial scan
Buffer
image_in
GaussBlur loop
Cavity Detector
28Loop trafo - cavity detection
N x M
Scanner
X
Y
From double bufferto single buffer
29Loop interchange (Y ? X)
for (x0xltNx) for (y0yltMy) /
filtering code /
for (y0yltMy) for (x0xltNx) /
filtering code /
- Not always possible check dependences
- For all loops, to maintain regularity
30Loop trafo - cavity detection
N x (2GB1)
N x 3
Compute Edges
Gauss Blur y
Gauss Blur x
Repeated fold and loop merge
3(offset arrays)
2GB1
From N x M toN x (3) buffer size
From N x M toN x (2GB1) buffer size
31Improve regularity and locality? Loop Merging
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (y0yltMy) for
(x0xltNx) / 2nd filtering code /
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (x0xltNx) / 2nd
filtering code /
- !! Impossible due to dependencies!
32Data dependencies between1st and 2nd loop
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0yltMy) for
(x0xltNx) for (k-GB kltGB k)
gauss_x_imagexyk
33Enable merging withLoop Folding (bumping)
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0GByltMGBy)
for (x0xltNx) y-GB for (k-GB
kltGB k) gauss_x_imagexyk-GB
34Y-loop merging on 1st and 2nd loop nest
for (y0yltMGBy) if (yltM) for
(x0xltNx) gauss_x_imagexy
if (ygtGB) for (x0xltNx) if
(xgtGB xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB
k) gauss_x_imagexy-GBk
else
35Simplify conditions in merged loop
for (y0yltMGBy) for (x0xltNx) if
(yltM) gauss_x_imagexy
for (x0xltNx) if (ygtGB xgtGB
xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB k)
gauss_x_imagexy-GBk else if
(ygtGB)
36Global loop merging/folding steps
- 1 x ? y Loop interchange (done)
- 2 Global y-loop folding/merging 1st and 2nd nest
(done) - 3 Global y-loop folding/merging 1st/2nd and 3rd
nest - 4 Global y-loop folding/merging 1st/2nd/3rd and
4th nest - 5 Global x-loop folding/merging 1st and 2nd nest
- 6 Global x-loop folding/merging 1st/2nd and 3rd
nest - 7 Global x-loop folding/merging 1st/2nd/3rd and
4th nest
37End result of global loop trafo
for (y0 yltMGB2 y) for (x0 xltN2
x) if (xgtGB xltN-1-GB
(y-GB)gtGB (y-GB)ltM-1-GB)
gauss_xy_computexy-GB0 0 for
(k-GB kltGB k) gauss_xy_computexy-
GBGBk1 gauss_xy_computexy-GB
GBk gauss_x_imagexy-GBk
Gaussabs(k) gauss_xy_imagexy-GB
gauss_xy_computexy-GB(2GB)1/tot
else if (xltN (y-GB)gt0 (y-GB)ltM)
gauss_xy_imagexy-GB 0
38Loop transformations - result
39Data re-use memory hierarchy
A 100
Processor Data Paths
Reg File
100
10
1
P (original) access x power/access 100
P (after) 100 x 0.01 10 x 0.1 1 x 1 3
- Introduce memory hierarchy
- reduce number of reads from main memory
- heavily accessed arrays stored in smaller memories
40Data re-use
- Data flow transformations to introduce
extracopies of heavily accessed signals - Step 1 figure out data re-use possibilities
- Step 2 calculate possible gain
- Step 3 decide on data assignment to memory
hierarchy
41Data re-use
- Data flow transformations to introduce
extracopies of heavily accessed signals - Step 1 figure out data re-use possibilities
- Step 2 calculate possible gain
- Step 3 decide on data assignment to memory
hierarchy
1216
N216
42Data re-use tree
image_in
gauss_xy/comp_edge
gauss_x
image_out
NM
M3
M3
M3
NM
NM
NM3
NM3
NM
0
11
N1
13
33
NM
NM8
NM8
NM3
31
NM3
CPU
CPU
CPU
CPU
CPU
43Memory hierarchy assignment
image_in
gauss_x
gauss_xy
comp_edge
image_out
NM
NM
1MB SDRAM
0
NM
M3
M3
M3
16KB Cache
NM3
NM3
NM
NM
NM3
128 B RegFile
11
11
31
33
33
NM3
NM8
NM8
NM8
NM8
44Data-reuse - cavity detection code
Code before reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
if (xgt1 xltN-2 ygt1 yltM-2)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_compute)
else if (xltN yltM)
gauss_x_linesxy 0 / Other
merged code omitted /
45Data-reuse - cavity detection code
Code after reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
/ first in_pixels initialized / if (x0
ygt1 yltM-2) for (k0 klt1 k)
in_pixels(xk)3 image_inxky /
copy rest of in_pixels in row / if (xgt0
xltN-2 ygt1 yltM-2)
in_pixels(x1)3 image_inx1y if
(xgt1 xltN-1-1 ygt1 yltM-2)
gauss_x_tmp0 for (k-1 klt1 k)
gauss_x_tmp in_pixels(xk)3GaussAbs(k)
gauss_x_linesxy3 foo(gauss_x_tmp)
else if (xltN yltM) gauss_x_linesxy
3 0
46Data reuse memory hierarchy
47Data layout optimization
- At this point multi-dimensional arraysare to be
assigned to physical memories - Data layout optimization determines exactly where
in each memory an array should be placed, to - reduce memory size by in-placing arrays that do
not overlap in time (disjoint lifetimes) - to avoid cache misses due to conflicts
- exploit spatial locality of the data in memory to
improve performance of e.g. page-mode memory
access sequences
48In-place mapping
Inter in-place
Both intrainter
addresses
Intra in-place
time
49In-place mapping
- Implements all the anticipated memory size
savings obtained in previous steps - Modifies code to introduce one array per real
memory - Changes indices to addresses in mem. arrays
b8 A100100 b6 B2020 for (i,j,k,l )
Bij f(Bji, Aikjl)
50In-place mapping
- Input image is partly consumed by the time first
results for output image are ready
index
Image_in
time
index
Image_out
time
51In-place - cavity detection code
for (y0 yltM3 y) for (x0 xltN5 x)
image_outx-5y-3 / code
removed / image_inx1y
for (y0 yltM3 y) for (x0 xltN5 x)
imagex-5y-3 / code
removed / image x1y
52In-place mapping - results
53The last step ADOPT
(Address OPTimization)
- Increased execution time introduced by DTSE
- Complicated address arithmetic (modulo!)
- Additional complex control flow
- Additional transformations needed to
- Simplify control flow
- Simplify address arithmetic common
sub-expression elimination, modulo expansion, - Match remaining expressions on target machine
54ADOPT principles
Example Full-search Motion Estimation
for (i- 8 ilt8 i) for (j- 4 jlt3
j) for (k- 4 klt3 k)
A((208i)2578j)257 16ik
B(8j)25716ik dist A3096 -
B((208i)2574)257 16i-4
cse1 (33025i6869616)2 cse3 1040i
cse4 j2571032 cse5
kcse4 cse5cse1 cse5cse3
3096 cse1
Algebraic transformations at word-level
55Address optimization - result
56Fixing platform parameters
- Assume configurable on-chip memory hierarchy
- Trade-off power versus cycle-budget
power mW
25
20
15
10
5
storagecyclebudget
50,000
100,000
150,000
57Conclusion
- Many applications use large (static) data
structures - Access and layout of this data can be heavily
optimized - Compilers don't do this
- Source code (C-to-C) transformations needed !!
- Showed systematic approach
- Platform independent high-level transformations
- Platform dependent transformations exploit
platform characteristics (optimal use of cache,
) - Substantial energy, memory size (cost) and
performance improvements - MPEG-4, OFDM, H.263, ADSL, ...