Title: Memory Optimizations in Embedded Systems
1Memory Optimizations in Embedded Systems
- Preeti Ranjan Panda
- Synopsys, Inc.
- Email panda_at_synopsys.com
- Embedded Systems Design Workshop
- 2-4 January, 2002, I.I.T. Delhi
2Why is memory important?
Memory access is just another instruction...
...so why treat memory differently?
3Rate of Performance Improvement is different
CPU
Speed
Memory
Year
4Impact on Processor Pipeline
Clock cycle determined by slowest pipeline stage
5Memory Hierarchy
- To retain smaller clock cycle, we keep small
memory in pipeline - Leads to Memory Hierarchy
Main Memory
6Impact of Memory Architecture Decisions
- Area
- 50-70 of ASIC/ASIP may be memory
- Performance
- 10-90 of system performance may be memory
related - Power
- 25-40 of system power may be memory related
7Important Memory Decisions in Embedded Systems
- What is a good memory architecture for an
application? - Total memory requirement
- Delay due to memory
- Power dissipation due to memory access
- Compiler and Synthesis tool (Exploration tools)
should make informed decisions on - Registers and Register files
- Cache parameters
- Number and size of memory banks
8Embedded Systems Path to Implementation
Specification/ Program
HW/Software Partitioning
HW
SW
- Synthesis Flow
- High Level Synthesis
- RTL Synthesis
- Logic Synthesis
- Compiler Flow
- Parsing
- Optimizations
- Code Generation
9Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
10Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
11High Level Synthesis
- Under Constraints
- Total Delay
- Limited Resources
12High Level Synthesis Scheduling
13High Level Synthesis Resource Allocation and
Binding
14Registers in High Level Synthesis
B
C
D
A
Resource Constraint - 2 Adders
15Register Access Model
16Limitation of Registers
- Complex Interconnect
- Every register connects to every FU
R1
R2
R3
R4
-
FU1
FU2
FU3
FU4
17Register Files
R1
R2
R3
- Modular architecture
- Limited connectivity
- New optimization opportunities
18Access Model of Register Files
Register File
19Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
20Motivation for SRAM
- Limitation of Register File
- OK for scalar variables
- NOT OK for array variables
- Need to handle large address space
- But retain fast access to scalar variables
21SRAM Access
Data Bus
Data Bus
SRAM
Data Bus
22SRAM-based Architecture
Address
Data
Similar to Processor But Predictability Necessary
23Memory Model in HLS
Multicycle Operations
24Behavioral Templates
1. Defines precedence constraints between
stages 2. Templates are used directly by
scheduler
Stage 1
Stage 2
Stage 3
25Templates for Memory Access
3-Cycle MEMORY READ
26Using Memory Templates
- Operation can be scheduled into Cycle 1
- No change to scheduling algorithm
- Used in Synopsys Behavioral Compiler
27Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
28Motivation for DRAM
- Large arrays stored in off-chip DRAM
- Embedded DRAM technology
29DRAM-based Architecture
1 cycle
Address
Reg File
SRAM
DRAM access times are not constant!
Data
10 cycles
DRAM
30Typical DRAM Organization
Row Decoder
Page
Row Addr
Addr Bus
Cell Array
Data Bus
Col Addr
Column Decoder
31Memory Read Operation
Col Decoder
Synthesis Model
32Reading Multiple Words
Row Addr
Sample Behavior FindAverage
Av (b0 b1 b2 b3) / 4
Col Addr
Memory Read 7 cycles Add 1 cycle Shift
1 cycle
Data
Schedule Length 7 x 4 28 cycles
Memory Read
33Page Mode Read
Behavior FindAverage
Av (b0 b1 b2 b3) / 4
Schedule Length 14 cycles (50 faster)
Col Decoder
34Modeling Memory Operations
Col Addr
Data
Col Addr
Data
Col Decode (Read)
Col Decode (Write)
Row Decode
Precharge
Setup
35Memory Write Operation
Row Address
Row Decode
Column Address
Data
Column Decode
Precharge
36Read-Modify-Write (R-M-W)
Row (A0)
Column (A0)
x
3
Schedule Length 10 cycles
Separate Read/Write Would Cost 14 cycles
37Page Mode Write
Row Decode
Precharge
Page Mode Write
for i 0 to 7 bi 0
Example Behavior
38Page Mode R-M-W
Row Decode
Precharge
Page Mode R-M-W
for i 0 to 7 bi bi 1
Example Behavior
39Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
40Clustering of Scalars
a
a
b
C a b
P
b
2 Reads
Page Mode Read
Problem Cluster variables into groups of size P
to maximize page mode operations
Analogous to Cache Line Clustering Problem
41Reordering Memory Accesses
a, b in different pages
a i a i b i
Read ai Read bi Write ai
21 cycles
16 cycles
42Reordering Memory Accesses
bi
ci
t bi ci ci t s t di bi
s di s
di
ci
di
bi
R-M-W Paths
R-M-W Possible Only for Non-Intersecting Paths
43Hoisting
p d 2 c p 1 if (c 0) y a0 else
y a1
44Loop Transformations 1
45Loop Transformations 2
Multiple Arrays Accessed in Loop
for (i 0 i Read ai1 s1 Read bi s2 Read
bi1 Write ci, t1 Write ci1,
t2 end
for (i 0 i Read bi Write ci, t1 end
Unroll
a, b, c in Different Pages
46Loop Transformations 3
Loops with Disjoint Sub-graphs in Body
for (i 0 i 0 end
Split
No Page Mode Without Unroll
47CDFG Transformation
- Cluster Scalars and Assign Addresses
- For each Basic Block
- Perform R-M-W Reordering
- For each Conditional
- Perform Hoisting, if applicable
- For each Inner Loop L
- Perform Loop Splitting, if applicable
- For each Loop L in new CDFG
- Perform Loop Restructuring/Unroll
48DRAM Optimization Experiments
Schedule Length (Normalized)
Average Optimized 40 faster than Fine Grain
49Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
50Life-time of Variables
Life-time definition to last use of variable
51Conflict Graph of Life-times
x
y
x
y
z
p
z
p
q
r
q
r
52Colouring the Conflict Graph
Minimum number of registers Chromatic number of
conflict graph
53Colouring determines Register Allocation
x
y
z
x
y
z
p
p
q
q
r
r
54Minimizing Register Count
- Graph Colouring is NP-complete
- Heuristics (Growing clusters)
- Polynomial time solution exists for straight line
code (no branches) - Left-edge algorithm
- Possible to incorporate other factors
- Interconnect cost annotated as edge-weight
55Register Files/Multiport Memories
- Scalar approach infeasible for 100s of registers
- interconnect delays dominate
- Need to store variables in Register Files
- Limited Bandwidth
Problem How to do Register Allocation
to Register Files efficiently?
56Which variables go into Multiport Memory?
Problem Given a Schedule and a Multiport memory,
which variables should be stored in the memory?
State1 R1 R2 R3 State2 R2 R1 R1
Schedule
Which registers should go into Dual-port Memory?
57ILP Formulation
Maximize x1 x2 x3 (Maximize regs
stored in Memory) Constraints x1 x2 x3
x2 accesses)
Solution x1 1, x2 1, x3 0 (Store R1 and R2
in Memory)
58Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
59Storing Array Data
- Usually, storage dictated by language
- In embedded systems, we can reorder data
- entire application should be visible
- New challenges and optimisation opportunities
- Data storage strategies
- Memory architecture customisation
60Storing Multi-dimensional Arrays Row-major
0
int X 44
Row-major Storage
Physical Memory
Logical Memory
15
61Storing Multi-dimensional Arrays Column-major
0
int X 44
Column-major Storage
Physical Memory
Logical Memory
15
62Storing Multi-dimensional Arrays Tile-based
0
int X 44
Tile-based Storage
Physical Memory
Logical Memory
15
63Impact of Storage Strategy
- Successive memory references should be local
(independent of memory architecture) - Better data cache performance/energy
- Reduced address bus switching
64Array Access Pattern Determines Storage Row-major
for (i 0 i j) A ij 0
Logical Memory
65Array Access Pattern Determines Storage
Column-major
for (i 0 i j) B ji Bji 1
Logical Memory
Note Effect can also be achieved by Loop
Interchange
66Array Access Pattern Determines StorageTile-based
Simplified kernel of Successive Over-relaxation
Algorithm
ui-1j
for (i 1 i N-1 j2) ... u i-1j u ij-1
u ij u ij1
u i1j
uij
uij1
ui-1j
ui1j
67Determining Tile Shape
Execution trace
68Determining Tile Shape
New elements in each iteration
Tile smallest rectangle enclosing access pattern
69Address Switching Analysis
Definitions
Minimal Transition Small Offset Fewer address
bits switching
Maximal Transition Large Offset More address
bits switching
70Address Bus Switching Row-major
3 Maximal Transitions per Iteration
Maximal Transitions Rows accessed in
iteration
71Address Bus SwitchingTile-based (1)
Case 1 No Maximal Transition
72Address Bus SwitchingTile-based (2)
Each iteration spans at most 2 tiles
Case 2 No more than 2 Maximal Transitions per
iteration
Case 1 No Maximal Transition
73Mapping Strategy
If (Outer Loop Increment Tile width) / Case
1 No Maximal Transition / OR (Tile has 2 Rows
and Columns) / Case 2 2 Maximal Transitions /
Tile-based If Tile has / Columns / Column-major / Transitions /
Row-major
Column-major
Tile-based
74Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
75Array Layout and Data Cache
a
ai
int a 1024 int b1024 int c 1024 ... for
(i 0 i b
bi
c
Data Cache (Direct-mapped, 512 words)
ci
Memory
Problem Every access leads to cache miss
76Data Alignment
int a 1024 int b1024 int c 1024 ... for
(i 0 i Data Cache (Direct-mapped, 512 words)
Memory
Data alignment avoids cache conflicts
77Motivating Example
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
78Cache Performance Loop 1
Data Cache Direct-mapped 4 lines, 2 words/line
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
0
Loop 1
1
2
3
Line
79Cache Performance Loop 2
Data Cache Direct-mapped 4 lines, 2 words/line
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
0
1
Loop 2
2
3
Line
80Cache Performance
struct x int a int b p 1000 int q
1000 ... avg 0 for (i 0 i i) avg avg pi.a avg avg /
1000 ... for (i 0 i pi.b avg qi pi.b 1
1000 cache misses
1500 cache misses 1000 misses for pi.b 500
misses for qi
Cache miss rate 62.5
81Transformed Data Layout
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
struct x int a int b p 1000 int q
1000
Loop 1
Loop 2
82Cache Performance Loop 1
Data Cache Direct-mapped 4 lines, 2 words/line
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
0
Loop 1
1
2
3
No useless data in cache
Line
83Cache Performance Loop 2
Data Cache Direct-mapped 4 lines, 2 words/line
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
0
1
2
Loop 2
3
No useless data in cache
Line
84Cache Performance
struct y int q // originally q int b //
originally x.b r 1000 int a 1000 //
originally x.a ... avg 0 for (i 0 i i) avg avg ai avg avg / 1000 ... for
(i 0 i avg ri.q ri.b 1
500 cache misses
1000 cache misses
Cache miss rate 37.5
85Data Layout Transformation
- Splitting structs into individual arrays
- Account for pointer arithmetic, dereferencing
- Clustering of arrays
86Representing Array Accesses
A
for i 1 to 100 // Loop L1 Read Ai Read
Bi Read Ci for i 1 to 2000 // Loop
L2 Read Bi Read Ci for i 1 to 500 //
Loop L3 Read Ai Read Di
L1
(100)
B
L2
(2000)
C
L3
(500)
D
Bipartite Graph
87Clustering Algorithm
- Start with empty cluster set
- Sort all loops in decreasing order of array
access count - For each loop
- for each unassigned array in loop
- determine cost of assigning array to each
existing cluster (including Ø) - assign array to cluster with least cost
88Cost Computation
Correlated Arrays array indices are affine and
differ by a constant
- Penalty for assigning two correlated arrays into
separate clusters - Penalty for assigning two uncorrelated arrays
into the same cluster
89Experiments on DSP/Image/Scientific examples
Average reduction in cycle time 44
90Motivating Example FFT
- double sigreal 2048
- ...
- le le / 2
- for (i j i
- . . . sigreal i
- . . . sigreal i le
- . . .
- sigreal i . . .
- sigreal i le . . .
91Example FFT-Padded
- double sigreal 204816
- ...
- le le / 2 le le le / 128
- for (i j i
- i i i / 128
- . . . sigreal i
- . . . sigreal i le
- . . .
- sigreal i . . .
- sigreal i le . . .
15 Speed-up on Sparc5 due to Padding
92Loop Blocking
Original Code
Blocked Code
- for kk 1 to N by B
- for jj 1 to N by B
- for i 1 to N
- for k kk to min (kk B - 1, N)
- r X I,k
- for j jj to min (jj B - 1, N)
- ZI,j r Yk,j
- for i 1 to N
- for k 1 to N
- r X I,k
- for j 1 to N
- ZI,j r Yk,j
B
N
93Terminology
- Compulsory Cache Miss - Data Never Brought into
Cache - Capacity Cache Miss - Cache Too Small
- Conflict Cache Miss - Competition for Cache Space
- Self-Interference - Within Same Tile
- Cross-Interference - Across Different Tiles
Self-Interference
Tile
Cross-Interference
Array
Data Cache
94Self-Interference Conflicts
30
30
Conflict
256
30
256
256
256
256
1024-element Direct-Mapped Cache
256
95Padding Avoids Self-Interference
30
8
1
1
5
2
PAD
3
30
2
4
6
5
PAD
3
7
256
PAD
4
8
PAD
1024-element Direct-Mapped Cache
256
96Multiple Tiled Arrays
Tiles in Initial Arrays
New Tile
X
X
Y
R
PAD
R R PAD
97Stability of Cache Peformance
Matrix Multiplication (Array Sizes 35-350)
TSS
ESS
LRW
DAT
DAT uses Fixed Tile Dimensions Others use Widely
Varying Sizes
98Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
99Embedded System Synthesis
ai bi ci
Specification
Hw/Sw Partitioning
On-Chip/ Off-Chip DRAM
On-Chip Instruction Memory
Processor Core
Synthesized HW
On-Chip Data Memory
100Scratch-Pad Memory
- Embedded Processor-based System
- Processor Core
- Embedded Memory
- Instruction and Data Cache
- Embedded SRAM
- Embedded DRAM
- Problem Partition program data into on-chip and
off-chip memory
Scratch Pad Memory
101Memory Address Space
1 cycle
0
On-chip Memory
CPU
P-1
Off-chip Memory
P
Data Cache (on-chip)
Addressable Memory
1 cycle
10-20 cycles
N-1
102Architecture
CPU Core
Scratch-Pad Memory
Data Cache
Address
Data
Hit
External Memory Interface
Hit
DRAM
103Illustrative Example - 1
Procedure Histogram_Evaluation char BrightnessLeve
l 512 512 int Hist 256 for (i 0 i 512 i ) for (j 0 j each pixel (i,j) in image / level
BrightnessLevel ij Hist level Hist
level 1
Regular Access Assign to Cache
Irregular Access Assign to SRAM
104Illustrative Example - 2
Iteration (0,0)
mask (on-chip SRAM)
int source 128 128, dest 128 128, mask
4 4 Procedure CONV loop i loop j
dest i j Mult (source i j,
mask)
Iteration (0,1)
source (off-chip, accessed thru cache)
105Data Partitioning
- Pre-Partitioning
- Scalar Variables and Constants to SRAM
- Large Arrays to DRAM
- SRAM/Cache Partitioning
- Identify critical Data for Mapping to SRAM
- Criteria
- Life-times of arrays
- Access frequency of arrays (IF)
- Loop conflicts (LCF)
106Data Partitioning Experiments
Average 30 Improvement in Memory Cycles
107Reuse Analysis
Which Memory References Reuse a Cache Line?
Group Spatial
Self Spatial
loop i loop j a i j a i j1 a
i j-1 a i-1 a i1 j b i c
j i
Self Temporal
No Reuse
- Divide Memory References into Reuse Equivalence
Classes - Volume Analysis
108Architecture Exploration
Algorithm MemExplore for on-chip Memory Size T
(in powers of 2) for cache size C (in powers of
2, (S) for line size L (in powers of 2, (C,L) that Maximizes Performance
109Variation with Cache Line Size
Example Histogram
Cache Size 1 KB
110Variation with Cache/SP-RAM Ratio
Example Histogram
Effect of different ratios of SRAM and Cache.
Total On-chip Memory Size 2 KB
111Variation with Total On-chip Memory Space
Example Histogram
Variation of Memory Performance with Total
On-chip Memory
112Outline
- Memory Modeling in High Level Synthesis
- Registers and Register Files
- Modeling SRAM Access
- Modeling DRAM Access
- Optimizations
- Data Placement
- Register Allocation
- Storing Data in SRAM
- Data Cache
- Memory Customisation and Exploration
- Scratch Pad Memory
- Memory Banking
113Memory Banking Motivation
For I 0 to 1000 AI AI BI C2I
A
Row Address Addr 158
Page
B
C
Col Address Addr 70
Page Buffer
To Datapath
Address
114Memory Banking Motivation
For I 0 to 1000 AI AI BI C2I
AI
Row
BI
Row
Row
C2I
Col
Col
Col
Page Buffer
Page Buffer
Page Buffer
Addr
To Datapath
Addr
To Datapath
Addr
To Datapath
115Exploration Algorithm
- Algorithm Partition (G)
- for k 1 to M / do k-way partitioning /
- 1. Generate initial partition P
- 2. Generate n-move sequence into any of k
partitions - 3. Retain partition Pk with minimum Delay (Pk)
- 4. Plot (k, Area (Pk)) and (k, Delay (Pk)) on
exploration graph - end Algorithm
116Initial Partition
1 bank
2 banks
3 banks
4 banks
5 banks
A
B
C
D
E
Cut lines assign clusters to banks
117Cost Function Computation
- Area (Pk)
- Total Memory Area
- Delay (Pk)
- Estimate Schedule Length (list scheduling)
- Memory access delays unknown at this stage
- Page hits/misses unknown
- Determine ordering that minimises page misses for
given partition
118Memory Dependence Graph
AI1
EI
AI
1
AI
-
2
CI
x
HI
EI
DI
-
GI
DFG
119Partitioned MDG (PMDG)
MDG is basis for bank partitioning exploration
A
A
E
D
C
E
D
H
Bank 1
A
D
G
G
C
H
Bank 2
E
Bank 3
MDG
Analogous to DFG MDG
120Scheduling the PMDG
Need an ordering of the PMDG that minimises page
misses
A
A
D
D
G
- Topological Sort with minimum page misses
- Greedy Heuristic
121List Scheduling Heuristic
At each step, select access that leads to longest
sequence of page mode accesses
A
D
A
A
A
- Propagate schedulable condition
- Select largest set of page mode accesses
D
G
122ExperimentsExploration Results
IDCT
SOR
( banks)
( banks)
EQN_OF_STATE
2D-HYDRO
123Summary
- Memory in High Level Synthesis
- Registers, Register Files, and SRAM memory
modeled adequately in Synthesis tools today - More complex memory (DRAM)
- New modeling methodologies
- New Optimizations
- Memory in Embedded Processors
- Optimizations tailored to Data Cache
- Data Layout
- Memory architecture customized to a given
application - Scratch Pad Memory
- Memory Banking
124The End
Thank You For Attending!
125References
Books
- P. Panda, N. Dutt, A. Nicolau - Memory issues in
embedded systems-on-chip optimization and
exploration, Kluwer Academic Publishers, 1999 - F. Catthoor, S. Wuytack, E. De Greef, F. Balasa,
L. Nachtergaele, A. Vandecappelle Custom memory
management methodology, Kluwer Academic
Publishers, 1998
Survey Paper
- P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E.
Brockmeyer, C. Kulkarni, A. Vandecappelle Data
and Memory Optimization Techniques for Embedded
Systems, ACM Transactions on Design Automation of
Embedded Systems, April 2001