Memory Optimizations in Embedded Systems

About This Presentation

Title:

Memory Optimizations in Embedded Systems

Description:

(C) Preeti Ranjan Panda, Embedded Systems Design Workshop, 2-4 Jan 2002, I.I.T. Delhi ... Email: panda_at_synopsys.com. Embedded Systems Design Workshop. 2-4 ... – PowerPoint PPT presentation

Number of Views:385

Avg rating:3.0/5.0

Slides: 126

Provided by: preetiran

Category:

more less

Transcript and Presenter's Notes

Title: Memory Optimizations in Embedded Systems

1
Memory Optimizations in Embedded Systems

Preeti Ranjan Panda
Synopsys, Inc.
Email panda_at_synopsys.com
Embedded Systems Design Workshop
2-4 January, 2002, I.I.T. Delhi

2
Why is memory important?
Memory access is just another instruction...
...so why treat memory differently?
3
Rate of Performance Improvement is different
CPU
Speed
Memory
Year
4
Impact on Processor Pipeline
Clock cycle determined by slowest pipeline stage
5
Memory Hierarchy

To retain smaller clock cycle, we keep small
memory in pipeline
Leads to Memory Hierarchy

Main Memory
6
Impact of Memory Architecture Decisions

Area
50-70 of ASIC/ASIP may be memory
Performance
10-90 of system performance may be memory
related
Power
25-40 of system power may be memory related

7
Important Memory Decisions in Embedded Systems

What is a good memory architecture for an
application?
Total memory requirement
Delay due to memory
Power dissipation due to memory access
Compiler and Synthesis tool (Exploration tools)
should make informed decisions on
Registers and Register files
Cache parameters
Number and size of memory banks

8
Embedded Systems Path to Implementation
Specification/ Program
HW/Software Partitioning
HW
SW

Synthesis Flow
High Level Synthesis
RTL Synthesis
Logic Synthesis

Compiler Flow
Parsing
Optimizations
Code Generation

9
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

10
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

11
High Level Synthesis

Under Constraints
Total Delay
Limited Resources

12
High Level Synthesis Scheduling
13
High Level Synthesis Resource Allocation and
Binding
14
Registers in High Level Synthesis
B
C
D
A
Resource Constraint - 2 Adders

15
Register Access Model
16
Limitation of Registers

Complex Interconnect
Every register connects to every FU

R1
R2
R3
R4

-

FU1
FU2
FU3
FU4
17
Register Files
R1
R2
R3

Modular architecture
Limited connectivity
New optimization opportunities

18
Access Model of Register Files
Register File
19
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

20
Motivation for SRAM

Limitation of Register File
OK for scalar variables
NOT OK for array variables
Need to handle large address space
But retain fast access to scalar variables

21
SRAM Access
Data Bus
Data Bus
SRAM
Data Bus
22
SRAM-based Architecture
Address
Data
Similar to Processor But Predictability Necessary
23
Memory Model in HLS
Multicycle Operations
24
Behavioral Templates
1. Defines precedence constraints between
stages 2. Templates are used directly by
scheduler
Stage 1
Stage 2
Stage 3
25
Templates for Memory Access
3-Cycle MEMORY READ
26
Using Memory Templates

Operation can be scheduled into Cycle 1
No change to scheduling algorithm
Used in Synopsys Behavioral Compiler

27
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

28
Motivation for DRAM

Large arrays stored in off-chip DRAM
Embedded DRAM technology

29
DRAM-based Architecture
1 cycle
Address
Reg File
SRAM
DRAM access times are not constant!
Data

10 cycles
DRAM
30
Typical DRAM Organization
Row Decoder
Page
Row Addr
Addr Bus
Cell Array
Data Bus
Col Addr
Column Decoder
31
Memory Read Operation
Col Decoder
Synthesis Model
32
Reading Multiple Words
Row Addr
Sample Behavior FindAverage
Av (b0 b1 b2 b3) / 4
Col Addr
Memory Read 7 cycles Add 1 cycle Shift
1 cycle
Data
Schedule Length 7 x 4 28 cycles
Memory Read
33
Page Mode Read
Behavior FindAverage
Av (b0 b1 b2 b3) / 4
Schedule Length 14 cycles (50 faster)
Col Decoder
34
Modeling Memory Operations
Col Addr
Data
Col Addr
Data
Col Decode (Read)
Col Decode (Write)
Row Decode
Precharge
Setup
35
Memory Write Operation
Row Address
Row Decode
Column Address
Data
Column Decode
Precharge
36
Read-Modify-Write (R-M-W)
Row (A0)
Column (A0)
x
3
Schedule Length 10 cycles
Separate Read/Write Would Cost 14 cycles
37
Page Mode Write
Row Decode
Precharge
Page Mode Write
for i 0 to 7 bi 0
Example Behavior
38
Page Mode R-M-W
Row Decode
Precharge
Page Mode R-M-W
for i 0 to 7 bi bi 1
Example Behavior
39
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

40
Clustering of Scalars
a
a
b
C a b
P
b
2 Reads
Page Mode Read
Problem Cluster variables into groups of size P
to maximize page mode operations
Analogous to Cache Line Clustering Problem
41
Reordering Memory Accesses
a, b in different pages
a i a i b i
Read ai Read bi Write ai
21 cycles
16 cycles
42
Reordering Memory Accesses
bi
ci
t bi ci ci t s t di bi
s di s

di
ci

di
bi
R-M-W Paths
R-M-W Possible Only for Non-Intersecting Paths
43
Hoisting
p d 2 c p 1 if (c 0) y a0 else
y a1
44
Loop Transformations 1
45
Loop Transformations 2
Multiple Arrays Accessed in Loop
for (i 0 i Read ai1 s1 Read bi s2 Read
bi1 Write ci, t1 Write ci1,
t2 end
for (i 0 i Read bi Write ci, t1 end
Unroll
a, b, c in Different Pages
46
Loop Transformations 3
Loops with Disjoint Sub-graphs in Body
for (i 0 i 0 end
Split
No Page Mode Without Unroll
47
CDFG Transformation

Cluster Scalars and Assign Addresses
For each Basic Block
Perform R-M-W Reordering
For each Conditional
Perform Hoisting, if applicable
For each Inner Loop L
Perform Loop Splitting, if applicable
For each Loop L in new CDFG
Perform Loop Restructuring/Unroll

48
DRAM Optimization Experiments
Schedule Length (Normalized)
Average Optimized 40 faster than Fine Grain
49
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

50
Life-time of Variables
Life-time definition to last use of variable
51
Conflict Graph of Life-times
x
y
x
y
z
p
z
p
q
r
q
r
52
Colouring the Conflict Graph
Minimum number of registers Chromatic number of
conflict graph
53
Colouring determines Register Allocation
x
y
z
x
y
z
p
p
q
q
r
r
54
Minimizing Register Count

Graph Colouring is NP-complete
Heuristics (Growing clusters)
Polynomial time solution exists for straight line
code (no branches)
Left-edge algorithm
Possible to incorporate other factors
Interconnect cost annotated as edge-weight

55
Register Files/Multiport Memories

Scalar approach infeasible for 100s of registers
interconnect delays dominate
Need to store variables in Register Files
Limited Bandwidth

Problem How to do Register Allocation
to Register Files efficiently?
56
Which variables go into Multiport Memory?
Problem Given a Schedule and a Multiport memory,
which variables should be stored in the memory?
State1 R1 R2 R3 State2 R2 R1 R1
Schedule
Which registers should go into Dual-port Memory?
57
ILP Formulation
Maximize x1 x2 x3 (Maximize regs
stored in Memory) Constraints x1 x2 x3
x2 accesses)
Solution x1 1, x2 1, x3 0 (Store R1 and R2
in Memory)
58
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

59
Storing Array Data

Usually, storage dictated by language
In embedded systems, we can reorder data
entire application should be visible
New challenges and optimisation opportunities
Data storage strategies
Memory architecture customisation

60
Storing Multi-dimensional Arrays Row-major
0
int X 44
Row-major Storage
Physical Memory
Logical Memory
15
61
Storing Multi-dimensional Arrays Column-major
0
int X 44
Column-major Storage
Physical Memory
Logical Memory
15
62
Storing Multi-dimensional Arrays Tile-based
0
int X 44
Tile-based Storage
Physical Memory
Logical Memory
15
63
Impact of Storage Strategy

Successive memory references should be local
(independent of memory architecture)
Better data cache performance/energy
Reduced address bus switching

64
Array Access Pattern Determines Storage Row-major
for (i 0 i j) A ij 0
Logical Memory
65
Array Access Pattern Determines Storage
Column-major
for (i 0 i j) B ji Bji 1
Logical Memory
Note Effect can also be achieved by Loop
Interchange
66
Array Access Pattern Determines StorageTile-based
Simplified kernel of Successive Over-relaxation
Algorithm
ui-1j
for (i 1 i N-1 j2) ... u i-1j u ij-1
u ij u ij1
u i1j
uij
uij1
ui-1j
ui1j
67
Determining Tile Shape
Execution trace
68
Determining Tile Shape
New elements in each iteration
Tile smallest rectangle enclosing access pattern
69
Address Switching Analysis
Definitions
Minimal Transition Small Offset Fewer address
bits switching
Maximal Transition Large Offset More address
bits switching
70
Address Bus Switching Row-major
3 Maximal Transitions per Iteration
Maximal Transitions Rows accessed in
iteration
71
Address Bus SwitchingTile-based (1)
Case 1 No Maximal Transition
72
Address Bus SwitchingTile-based (2)
Each iteration spans at most 2 tiles
Case 2 No more than 2 Maximal Transitions per
iteration
Case 1 No Maximal Transition
73
Mapping Strategy
If (Outer Loop Increment Tile width) / Case
1 No Maximal Transition / OR (Tile has 2 Rows
and Columns) / Case 2 2 Maximal Transitions /
Tile-based If Tile has / Columns / Column-major / Transitions /
Row-major
Column-major
Tile-based
74
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

75
Array Layout and Data Cache
a
ai
int a 1024 int b1024 int c 1024 ... for
(i 0 i b
bi
c
Data Cache (Direct-mapped, 512 words)
ci
Memory
Problem Every access leads to cache miss
76
Data Alignment
int a 1024 int b1024 int c 1024 ... for
(i 0 i Data Cache (Direct-mapped, 512 words)
Memory
Data alignment avoids cache conflicts
77
Motivating Example
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
78
Cache Performance Loop 1
Data Cache Direct-mapped 4 lines, 2 words/line
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
0
Loop 1
1
2
3
Line
79
Cache Performance Loop 2
Data Cache Direct-mapped 4 lines, 2 words/line
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
0
1
Loop 2
2
3
Line
80
Cache Performance
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
1000 cache misses
1500 cache misses 1000 misses for pi.b 500
misses for qi
Cache miss rate 62.5
81
Transformed Data Layout
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
struct x int a int b p 1000 int q
1000
Loop 1
Loop 2
82
Cache Performance Loop 1
Data Cache Direct-mapped 4 lines, 2 words/line
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
0
Loop 1
1
2
3
No useless data in cache
Line
83
Cache Performance Loop 2
Data Cache Direct-mapped 4 lines, 2 words/line
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
0
1
2
Loop 2
3
No useless data in cache
Line
84
Cache Performance
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
500 cache misses
1000 cache misses
Cache miss rate 37.5
85
Data Layout Transformation

Splitting structs into individual arrays
Account for pointer arithmetic, dereferencing
Clustering of arrays

86
Representing Array Accesses
A
for i 1 to 100 // Loop L1 Read Ai Read
Bi Read Ci for i 1 to 2000 // Loop
L2 Read Bi Read Ci for i 1 to 500 //
Loop L3 Read Ai Read Di
L1
(100)
B
L2
(2000)
C
L3
(500)
D
Bipartite Graph
87
Clustering Algorithm

Start with empty cluster set
Sort all loops in decreasing order of array
access count
For each loop
for each unassigned array in loop
determine cost of assigning array to each
existing cluster (including Ø)
assign array to cluster with least cost

88
Cost Computation
Correlated Arrays array indices are affine and
differ by a constant

Penalty for assigning two correlated arrays into
separate clusters
Penalty for assigning two uncorrelated arrays
into the same cluster

89
Experiments on DSP/Image/Scientific examples
Average reduction in cycle time 44
90
Motivating Example FFT

double sigreal 2048
...
le le / 2
for (i j i
. . . sigreal i
. . . sigreal i le
. . .
sigreal i . . .
sigreal i le . . .

91
Example FFT-Padded

double sigreal 204816
...
le le / 2 le le le / 128
for (i j i
i i i / 128
. . . sigreal i
. . . sigreal i le
. . .
sigreal i . . .
sigreal i le . . .

15 Speed-up on Sparc5 due to Padding
92
Loop Blocking
Original Code
Blocked Code

for kk 1 to N by B
for jj 1 to N by B
for i 1 to N
for k kk to min (kk B - 1, N)
r X I,k
for j jj to min (jj B - 1, N)
ZI,j r Yk,j

for i 1 to N
for k 1 to N
r X I,k
for j 1 to N
ZI,j r Yk,j

B
N
93
Terminology

Compulsory Cache Miss - Data Never Brought into
Cache
Capacity Cache Miss - Cache Too Small
Conflict Cache Miss - Competition for Cache Space
Self-Interference - Within Same Tile
Cross-Interference - Across Different Tiles

Self-Interference
Tile
Cross-Interference
Array
Data Cache
94
Self-Interference Conflicts
30
30
Conflict
256
30
256
256
256
256
1024-element Direct-Mapped Cache
256
95
Padding Avoids Self-Interference
30
8
1
1
5
2
PAD
3
30
2
4
6
5
PAD
3
7
256
PAD
4
8
PAD
1024-element Direct-Mapped Cache
256
96
Multiple Tiled Arrays
Tiles in Initial Arrays
New Tile
X
X
Y
R
PAD
R R PAD
97
Stability of Cache Peformance
Matrix Multiplication (Array Sizes 35-350)
TSS
ESS
LRW
DAT
DAT uses Fixed Tile Dimensions Others use Widely
Varying Sizes
98
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

99
Embedded System Synthesis
ai bi ci
Specification
Hw/Sw Partitioning
On-Chip/ Off-Chip DRAM
On-Chip Instruction Memory
Processor Core
Synthesized HW
On-Chip Data Memory
100
Scratch-Pad Memory

Embedded Processor-based System
Processor Core
Embedded Memory
Instruction and Data Cache
Embedded SRAM
Embedded DRAM
Problem Partition program data into on-chip and
off-chip memory

Scratch Pad Memory
101
Memory Address Space
1 cycle
0
On-chip Memory
CPU
P-1
Off-chip Memory
P
Data Cache (on-chip)
Addressable Memory
1 cycle
10-20 cycles
N-1
102
Architecture
CPU Core
Scratch-Pad Memory
Data Cache
Address
Data
Hit
External Memory Interface
Hit
DRAM
103
Illustrative Example - 1
Procedure Histogram_Evaluation char BrightnessLeve
l 512 512 int Hist 256 for (i 0 i 512 i ) for (j 0 j each pixel (i,j) in image / level
BrightnessLevel ij Hist level Hist
level 1
Regular Access Assign to Cache
Irregular Access Assign to SRAM
104
Illustrative Example - 2
Iteration (0,0)
mask (on-chip SRAM)
int source 128 128, dest 128 128, mask
4 4 Procedure CONV loop i loop j
dest i j Mult (source i j,
mask)
Iteration (0,1)
source (off-chip, accessed thru cache)
105
Data Partitioning

Pre-Partitioning
Scalar Variables and Constants to SRAM
Large Arrays to DRAM
SRAM/Cache Partitioning
Identify critical Data for Mapping to SRAM
Criteria
Life-times of arrays
Access frequency of arrays (IF)
Loop conflicts (LCF)

106
Data Partitioning Experiments
Average 30 Improvement in Memory Cycles
107
Reuse Analysis
Which Memory References Reuse a Cache Line?
Group Spatial
Self Spatial
loop i loop j a i j a i j1 a
i j-1 a i-1 a i1 j b i c
j i
Self Temporal
No Reuse

Divide Memory References into Reuse Equivalence
Classes
Volume Analysis

108
Architecture Exploration
Algorithm MemExplore for on-chip Memory Size T
(in powers of 2) for cache size C (in powers of
2, (S) for line size L (in powers of 2, (C,L) that Maximizes Performance
109
Variation with Cache Line Size
Example Histogram
Cache Size 1 KB
110
Variation with Cache/SP-RAM Ratio
Example Histogram
Effect of different ratios of SRAM and Cache.
Total On-chip Memory Size 2 KB
111
Variation with Total On-chip Memory Space
Example Histogram
Variation of Memory Performance with Total
On-chip Memory
112
Outline

Memory Modeling in High Level Synthesis
Registers and Register Files
Modeling SRAM Access
Modeling DRAM Access
Optimizations
Data Placement
Register Allocation
Storing Data in SRAM
Data Cache
Memory Customisation and Exploration
Scratch Pad Memory
Memory Banking

113
Memory Banking Motivation
For I 0 to 1000 AI AI BI C2I
A
Row Address Addr 158
Page
B
C
Col Address Addr 70
Page Buffer
To Datapath
Address
114
Memory Banking Motivation
For I 0 to 1000 AI AI BI C2I
AI
Row
BI
Row
Row
C2I
Col
Col
Col
Page Buffer
Page Buffer
Page Buffer
Addr
To Datapath
Addr
To Datapath
Addr
To Datapath
115
Exploration Algorithm

Algorithm Partition (G)
for k 1 to M / do k-way partitioning /
1. Generate initial partition P
2. Generate n-move sequence into any of k
partitions
3. Retain partition Pk with minimum Delay (Pk)
4. Plot (k, Area (Pk)) and (k, Delay (Pk)) on
exploration graph
end Algorithm

116
Initial Partition
1 bank
2 banks
3 banks
4 banks
5 banks
A
B
C
D
E
Cut lines assign clusters to banks
117
Cost Function Computation

Area (Pk)
Total Memory Area
Delay (Pk)
Estimate Schedule Length (list scheduling)
Memory access delays unknown at this stage
Page hits/misses unknown
Determine ordering that minimises page misses for
given partition

118
Memory Dependence Graph
AI1
EI
AI
1
AI
-

2
CI
x

HI
EI
DI
-
GI
DFG
119
Partitioned MDG (PMDG)
MDG is basis for bank partitioning exploration
A
A
E
D
C
E
D
H
Bank 1
A
D
G
G
C
H
Bank 2
E
Bank 3
MDG
Analogous to DFG MDG
120
Scheduling the PMDG
Need an ordering of the PMDG that minimises page
misses
A
A
D
D
G

Topological Sort with minimum page misses
Greedy Heuristic

121
List Scheduling Heuristic
At each step, select access that leads to longest
sequence of page mode accesses
A
D
A
A
A

Propagate schedulable condition
Select largest set of page mode accesses

D
G
122
ExperimentsExploration Results
IDCT
SOR
( banks)
( banks)
EQN_OF_STATE
2D-HYDRO
123
Summary

Memory in High Level Synthesis
Registers, Register Files, and SRAM memory
modeled adequately in Synthesis tools today
More complex memory (DRAM)
New modeling methodologies
New Optimizations
Memory in Embedded Processors
Optimizations tailored to Data Cache
Data Layout
Memory architecture customized to a given
application
Scratch Pad Memory
Memory Banking

124
The End
Thank You For Attending!
125
References
Books

P. Panda, N. Dutt, A. Nicolau - Memory issues in
embedded systems-on-chip optimization and
exploration, Kluwer Academic Publishers, 1999
F. Catthoor, S. Wuytack, E. De Greef, F. Balasa,
L. Nachtergaele, A. Vandecappelle Custom memory
management methodology, Kluwer Academic
Publishers, 1998

Survey Paper

P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E.
Brockmeyer, C. Kulkarni, A. Vandecappelle Data
and Memory Optimization Techniques for Embedded
Systems, ACM Transactions on Design Automation of
Embedded Systems, April 2001

Write a Comment

User Comments (0)

About PowerShow.com

Memory Optimizations in Embedded Systems - PowerPoint PPT Presentation

Memory Optimizations in Embedded Systems

(C) Preeti Ranjan Panda, Embedded Systems Design Workshop, 2-4 Jan 2002, I.I.T. Delhi ... Email: panda_at_synopsys.com. Embedded Systems Design Workshop. 2-4 ... – PowerPoint PPT presentation