Processor Architectures and Program Mapping presentation

About This Presentation

Transcript and Presenter's Notes

Title: Processor Architectures and Program Mapping

1
Processor Architectures and Program Mapping
Data Memory Management Part a Overview

5kk10 TU/e
Henk Corporaal
Jef van Meerbergen
Bart Mesman

2
Data Memory Management Overview

Motivation
Example application
DMM steps
Results
Notes
We concentrate on Static Data structures like
arrays
The Data Transfer and Storage Exploration
(DTSE)methodology, on which these slides are
based, has been developed at IMEC, Leuven

3
Design flow
4
The underlying idea
for (i0iltni) for (j0 jlt3 j) for
(k1 klt7 k) Bj Ai4k
5
Platform example TriMedia
6
Platform architecture model
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
7
Data transfer and storage power
8
What about delay of memories?

Global wiring delay becomes dominant over gate
delay

9
Positioning in the Y-chart
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
10
Mapping

Given
architecture e.g. TriMedia TM1000
reference C code for applicatione.g. MPEG-4
Motion Estimation
Task
map application on architecture
But wait a moment
me_at_workgt tmcc -o mpeg4_me mpeg4_me.cThank you
for running TriMedia compiler.Your program uses
257321886 bytes,78 Watt, and 428798765291 clock
cycles

11
Lets help the compiler ...DTSE data transfer
and storage exploration

Transforms C-code of the application
By focusing on multi-dimensional signals (arrays)
To better exploit platform capabilities
This overview covers the major steps to improve
power, area, performance trade-off in the context
of platform based design

12
Application example

Application domain
Computer Tomography in medical imaging
Algorithm
Cavity detection in CT-scans
Detect dark regions in successive images
Indicate cavity in brain

Bad news for owner of brain
13
Application
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y

Reference (conceptual) C code for the algorithm
all functions image_inN x Mt-1 -gt image_outN
x Mt
new value of pixel depends on its neighbors
neighbor pixels read from background memory
approximately 110 lines of C code (ignoring file
I/O etc)
experiments with N x M 640 x 400 pixels
straightforward implementation 6 image buffers

14
DMM (data mem. mgt.) principles
Off-chip SDRAM
Exploit limited life-time
15
DMM steps
C-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
Address optimization
C-out
16
The DM steps

Preprocessing
Rewrite code in 3 layers (parts)
Selective inlining, Single Assignment form, ....
Data flow transformations
Eliminate redundant transfers and storage
Loop and control flow transformations
Improve regularity of accesses and data locality
Data re-use and memory hierarchy layer assignment
Determine when to move which data between
memories to meet the cycle budget of the
application with low cost
Determine in which layer to put the arrays (and
copies)

17
The DM steps

Per memory layer
Cycle budget distribution
determine memory access constraints for given
cycle budget
Memory allocation and assignment
which memories to use, and where to put the
arrays
Data layout
determine how to combine and put arrays into
memories
Address optimization on the final C-code

18
Preprocessing Dividing an application in the 3
layers
Module1a
LAYER1
Module2
Module3
Module1b
- testbench call
- dynamic event behaviour
Synchronisation
- mode selection
LAYER2
int
func1(int a, int b)
LAYER3

return ab

19
Layered code structure
main() / Layer 1 code /
read_image(IN_NAME, image_in) cav_detect()
write_image(image_out)
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)

20
Layered code structure
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
/ Makes code for data access
/ / and data transfer explicit /
int foo(int arg1) / Layer 3 / /
arithmetic, data-dependent operations to be
mapped to data-path, controller /
21
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y)
gauss_x_imagexy0
for (x1 xltN-2 x) for (y1 yltM-2 y)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(
k) gauss_x_imagexy
foo(gauss_x_tmp)
accesses N M (N-2) (M-2)
22
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y) if
((xgt1 xltN-2) (ygt1 yltM-2))
gauss_x_tmp 0 for (k-1 klt1
k) gauss_x_tmp image_inxkyGau
ssabs(k) gauss_x_imagexy
foo(gauss_x_tmp) else
gauss_x_imagexy 0
accesses N M gain is almost 50
23
Data-flow transformation

In total 5 types of data-flow transformations
advanced signal substitution and (copy)
propagation
algebraic transformations (associativity etc.)
shifting delay lines
re-computation
transformations to eliminate bottlenecksfor
subsequent loop transformations

24
Data-flow transformation - result
25
Loop transformations

Loop transformations
improve regularity of accesses
improve temporal locality production ?
consumption
Expected influence
reduce temporary storage and (anticipated)
background storage

26
Global loop transformation steps applied to
cavity detection

Make all loop dimensions equal
Regularize loop traversalY and X loop
interchange
follow order of input stream
Y loop folding and global mergingX loop folding
and global merging
full, global scope regularity
nearly complete locality for main signals

27
Data enters Cavity Detectorrow-wise
serial scan
Buffer
image_in
GaussBlur loop
Cavity Detector
28
Loop trafo - cavity detection
N x M
Scanner
X
Y
From double bufferto single buffer
29
Loop interchange (Y ? X)
for (x0xltNx) for (y0yltMy) /
filtering code /
for (y0yltMy) for (x0xltNx) /
filtering code /

Not always possible check dependences
For all loops, to maintain regularity

30
Loop trafo - cavity detection
N x (2GB1)
N x 3
Compute Edges
Gauss Blur y
Gauss Blur x
Repeated fold and loop merge
3(offset arrays)
2GB1
From N x M toN x (3) buffer size
From N x M toN x (2GB1) buffer size
31
Improve regularity and locality? Loop Merging
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (y0yltMy) for
(x0xltNx) / 2nd filtering code /
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (x0xltNx) / 2nd
filtering code /

!! Impossible due to dependencies!

32
Data dependencies between1st and 2nd loop
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0yltMy) for
(x0xltNx) for (k-GB kltGB k)
gauss_x_imagexyk
33
Enable merging withLoop Folding (bumping)
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0GByltMGBy)
for (x0xltNx) y-GB for (k-GB
kltGB k) gauss_x_imagexyk-GB
34
Y-loop merging on 1st and 2nd loop nest
for (y0yltMGBy) if (yltM) for
(x0xltNx) gauss_x_imagexy
if (ygtGB) for (x0xltNx) if
(xgtGB xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB
k) gauss_x_imagexy-GBk
else
35
Simplify conditions in merged loop
for (y0yltMGBy) for (x0xltNx) if
(yltM) gauss_x_imagexy
for (x0xltNx) if (ygtGB xgtGB
xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB k)
gauss_x_imagexy-GBk else if
(ygtGB)
36
Global loop merging/folding steps

1 x ? y Loop interchange (done)
2 Global y-loop folding/merging 1st and 2nd nest
(done)
3 Global y-loop folding/merging 1st/2nd and 3rd
nest
4 Global y-loop folding/merging 1st/2nd/3rd and
4th nest
5 Global x-loop folding/merging 1st and 2nd nest
6 Global x-loop folding/merging 1st/2nd and 3rd
nest
7 Global x-loop folding/merging 1st/2nd/3rd and
4th nest

37
End result of global loop trafo
for (y0 yltMGB2 y) for (x0 xltN2
x) if (xgtGB xltN-1-GB
(y-GB)gtGB (y-GB)ltM-1-GB)
gauss_xy_computexy-GB0 0 for
(k-GB kltGB k) gauss_xy_computexy-
GBGBk1 gauss_xy_computexy-GB
GBk gauss_x_imagexy-GBk
Gaussabs(k) gauss_xy_imagexy-GB
gauss_xy_computexy-GB(2GB)1/tot
else if (xltN (y-GB)gt0 (y-GB)ltM)
gauss_xy_imagexy-GB 0
38
Loop transformations - result
39
Data re-use memory hierarchy
A 100
Processor Data Paths
Reg File
100
10
1
P (original) access x power/access 100
P (after) 100 x 0.01 10 x 0.1 1 x 1 3

Introduce memory hierarchy
reduce number of reads from main memory
heavily accessed arrays stored in smaller memories

40
Data re-use

Data flow transformations to introduce
extracopies of heavily accessed signals
Step 1 figure out data re-use possibilities
Step 2 calculate possible gain
Step 3 decide on data assignment to memory
hierarchy

41
Data re-use

Data flow transformations to introduce
extracopies of heavily accessed signals
Step 1 figure out data re-use possibilities
Step 2 calculate possible gain
Step 3 decide on data assignment to memory
hierarchy

1216
N216
42
Data re-use tree
image_in
gauss_xy/comp_edge
gauss_x
image_out
NM
M3
M3
M3
NM
NM
NM3
NM3
NM
0
11
N1
13
33
NM
NM8
NM8
NM3
31
NM3
CPU
CPU
CPU
CPU
CPU
43
Memory hierarchy assignment
image_in
gauss_x
gauss_xy
comp_edge
image_out
NM
NM
1MB SDRAM
0
NM
M3
M3
M3
16KB Cache
NM3
NM3
NM
NM
NM3
128 B RegFile
11
11
31
33
33
NM3
NM8
NM8
NM8
NM8
44
Data-reuse - cavity detection code
Code before reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
if (xgt1 xltN-2 ygt1 yltM-2)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_compute)
else if (xltN yltM)
gauss_x_linesxy 0 / Other
merged code omitted /
45
Data-reuse - cavity detection code
Code after reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
/ first in_pixels initialized / if (x0
ygt1 yltM-2) for (k0 klt1 k)
in_pixels(xk)3 image_inxky /
copy rest of in_pixels in row / if (xgt0
xltN-2 ygt1 yltM-2)
in_pixels(x1)3 image_inx1y if
(xgt1 xltN-1-1 ygt1 yltM-2)
gauss_x_tmp0 for (k-1 klt1 k)
gauss_x_tmp in_pixels(xk)3GaussAbs(k)
gauss_x_linesxy3 foo(gauss_x_tmp)
else if (xltN yltM) gauss_x_linesxy
3 0
46
Data reuse memory hierarchy
47
Data layout optimization

At this point multi-dimensional arraysare to be
assigned to physical memories
Data layout optimization determines exactly where
in each memory an array should be placed, to
reduce memory size by in-placing arrays that do
not overlap in time (disjoint lifetimes)
to avoid cache misses due to conflicts
exploit spatial locality of the data in memory to
improve performance of e.g. page-mode memory
access sequences

48
In-place mapping
Inter in-place
Both intrainter
addresses
Intra in-place
time
49
In-place mapping

Implements all the anticipated memory size
savings obtained in previous steps
Modifies code to introduce one array per real
memory
Changes indices to addresses in mem. arrays

b8 A100100 b6 B2020 for (i,j,k,l )
Bij f(Bji, Aikjl)
50
In-place mapping

Input image is partly consumed by the time first
results for output image are ready

index
Image_in
time
index
Image_out
time
51
In-place - cavity detection code
for (y0 yltM3 y) for (x0 xltN5 x)
image_outx-5y-3 / code
removed / image_inx1y
for (y0 yltM3 y) for (x0 xltN5 x)
imagex-5y-3 / code
removed / image x1y
52
In-place mapping - results
53
The last step ADOPT
(Address OPTimization)

Increased execution time introduced by DTSE
Complicated address arithmetic (modulo!)
Additional complex control flow
Additional transformations needed to
Simplify control flow
Simplify address arithmetic common
sub-expression elimination, modulo expansion,
Match remaining expressions on target machine

54
ADOPT principles
Example Full-search Motion Estimation
for (i- 8 ilt8 i) for (j- 4 jlt3
j) for (k- 4 klt3 k)
A((208i)2578j)257 16ik
B(8j)25716ik dist A3096 -
B((208i)2574)257 16i-4
cse1 (33025i6869616)2 cse3 1040i
cse4 j2571032 cse5
kcse4 cse5cse1 cse5cse3
3096 cse1
Algebraic transformations at word-level
55
Address optimization - result
56
Fixing platform parameters

Assume configurable on-chip memory hierarchy
Trade-off power versus cycle-budget

power mW
25
20
15
10
5
storagecyclebudget
50,000
100,000
150,000
57
Conclusion

Many applications use large (static) data
structures
Access and layout of this data can be heavily
optimized
Compilers don't do this
Source code (C-to-C) transformations needed !!
Showed systematic approach
Platform independent high-level transformations
Platform dependent transformations exploit
platform characteristics (optimal use of cache,
)
Substantial energy, memory size (cost) and
performance improvements
MPEG-4, OFDM, H.263, ADSL, ...

Write a Comment

User Comments (0)

About PowerShow.com

Processor Architectures and Program Mapping PowerPoint PPT Presentation