Title: EE201A Lecture 5
1EE201A - Lecture 5
Memory management
2Motivation
- Modeling of multi-dimensional arrays
- Memory management
3Motivation
- Step 1 model, or representation of arrays
- Step 2 given a model,
- Find a feasible implementation ( schedule)
- Optimize e.g. memory size, memory access, memory
latency, power or energy consumption
4Problem formulation
- Multi-dimensional periodic scheduling
- Given a signal flow graph G,
- With for each operation, a lower and upper bound
on each of its periods and its start time - With for each operation a cost factor
- Find a schedule s that satifies
- Timing constraints
- PU constraints
- Precedence constraints
- MPS is NP-hard
5Approach
- Decompose into two stages
- First stage Period Assignment
- Given SFG G
- Find Period assignment p
- Second stage Fixed-period Multidimensional
Periodic Scheduling - Given SFG G, fixed period vector,
- lower and upper bounds on the start time
- Find Start time assignment s and PU assignment
(W,h)
6More Examples II DCT
7Reading
- F. Catthoor, K. Danckaert, S. Wuytack, N. Dutt,
Code transformations for Data Transfer and
Storage Exploration Preprocessing in Multimedia
Processors, IEEE Design Test of Computers,
May-June 2001, pg. 70-82. - P. Panda, F. Catthoor, N. Dutt, et al, Data and
memory optimization techniques for embedded
systems, ACM Transactions on Design Automation
of Electronic Systems, Vol. 6, no. 2, April 2001,
pg. 149-206 (section 1 and 3 distributed today) - Class presentation
- P. Murthy, E.A.Lee, Multidimensional
Synchronous Dataflow, IEEE Transactions on
Signal Processing, Vol. 50, no. 7, July 2002 - (distributed today)
8Why is memory important?
Memory access is just another instruction...
...so why treat memory differently?
according to Hennessey/Patterson book
9Memory Performance Bottleneck
10Impact on Processor Pipeline
Clock cycle determined by slowest pipeline stage
11Memory Hierarchy
- To retain smaller clock cycle, we keep small
memory in pipeline - Leads to Memory Hierarchy
Main Memory
12Impact of Memory Architecture Decisions
- Area
- 50-70 of ASIC/ASIP may be memory
- Performance
- 10-90 of system performance may be memory
related - Power
- 25-40 of system power may be memory related
13Power Distribution in CMOS LSIs
Source Sakurai Kuroda, EDTC, 97, Tut. D Low
power Circuit Design Multimedia LSIs
14Memory Power Bottleneck
PROC
SRAM
SRAM
EXTERNAL MEMORY
DP
Embedded DRAM
MMU
P(Ext. Access) typ. 30 x P(Arithmetic
Operations)
P(Int. Memory) typ. 40 - 60 P(Chip)
15Important Memory Decisions in Embedded Systems
- What is a good memory architecture for an
application? - Total memory requirement
- Delay due to memory
- Power dissipation due to memory access
- Compiler and Synthesis tool (Exploration tools)
should make informed decisions on - Registers and Register files
- Cache parameters
- Number and size of memory banks
16Outline
- Model of Registers
- single registers
- register files
- number of registers
- number of register files
- next on-chip memory
- off-chip memory
17Design sequence
Spec
multidimensional arrays
Background memory management
Foreground memory management as part of HW or SW
scalars
18Embedded Systems Path to Implementation
Specification/ Program
HW/Software Partitioning
HW
SW
- Synthesis Flow
- High Level Synthesis
- RTL Synthesis
- Logic Synthesis
- Compiler Flow
- Parsing
- Optimizations
- Code Generation
19High Level Synthesis
- Under Constraints
- Total Delay
- Limited Resources
20High Level Synthesis Scheduling
Y A B Z C D X Y Z
Scheduled DFG
Spec
B
C
D
A
B
C
D
A
Z
Y
Z
Y
X
X
DFG
Assign clock cycles to operations
21High Level Synthesis Resource Allocation and
Binding
Scheduled DFG
RTL Implementation
B
C
D
A
Z
Y
Resource Library
X
Assign resources to operations
22Registers in High Level Synthesis
B
C
D
A
Resource Constraint - 2 Adders
Registers
23Register Access Model
Register Read
Operation
Register Write
24Limitation of Registers
- Complex Interconnect
- Every register connects to every FU
R1
R2
R3
R4
-
FU1
FU2
FU3
FU4
compare to VLIW crossbar network
25Register Files
R1
R2
R3
- Modular architecture
- Limited connectivity
- New optimization opportunities
26Access Model of Register Files
Register File
27Life-time of Variables
Register optimization technique
Life-time definition to last use of variable
x y z a x 1 p y 2 q z p r p
3 k q r
x
y
z
p
q
r
28Conflict Graph of Life-times
x
y
z
p
q
r
29Coloring the Conflict Graph
Minimum number of registers Chromatic number of
conflict graph
x
y
z
p
q
r
30Coloring determines Register Allocation
x
y
z
x
y
z
p
p
q
q
r
r
31Minimizing Register Count
- Graph Colouring is NP-complete
- Heuristics (Growing clusters)
- Polynomial time solution exists for straight line
code (no branches) - Left-edge algorithm
- Possible to incorporate other factors
- Interconnect cost annotated as edge-weight
- Overview paper Stok L., Jess J., Foreground
memory management in data path synthesis, Int.
J. of Circuits Theory Appl. 20, no 3, pg.
235-255, 1992.
32Register Files/Multiport Memories
- Scalar approach infeasible for 100s of registers
- interconnect delays dominate
- Need to store variables in Register Files
- Limited Bandwidth
Problem How to do Register Allocation
to Register Files efficiently?
33Which variables go into Multiport Memory?
Problem Given a Schedule and a Multiport memory,
which variables should be stored in the memory?
State1 R1 R2 R3 State2 R2 R1 R1
Schedule
Which registers should go into Dual-port Memory?
34ILP Formulation
Maximize x1 x2 x3 (Maximize regs
stored in Memory) Constraints x1 x2 x3
lt 2 (State1 Max 2 parallel accesses) x1
x2 lt 2 (State2 Max 2 parallel
accesses)
Solution x1 1, x2 1, x3 0 (Store R1 and R2
in Memory)
35Intermediate conclusion
- Memory management is important
- Two main types
- background memory optimization
(multidimensional arrays) - foreground memory optimization (scalars)
- Foreground memory
- registers graph coloring
- register files and limited access
- model of individual read/write operations to SRAM
36Motivation for SRAM
- Limitation of Register File
- OK for scalar variables
- NOT OK for array variables
- Need to handle large address space
- But retain fast access to scalar variables
37SRAM Access
Data Bus
Data Bus
SRAM
Data Bus
38SRAM-based Architecture
Address
Data
Similar to Processor But Predictability Necessary
39Memory Model in HLS
Multicycle Operations
Address
Address
Data
Read
Write
Data
40Behavioral Templates
A
B
1. Defines precedence constraints between
stages 2. Templates are used directly by
scheduler
Stage 1
Stage 2
C
Stage 3
D
41Templates for Memory Access
Address
Address
Stage 1
Stage 1
Data
Stage 2
Stage 2
Stage 3
Stage 3
Data
3-Cycle MEMORY WRITE
3-Cycle MEMORY READ
42Using Memory Templates
Address
Cycle 1
Cycle 2
- Operation can be scheduled into Cycle 1
- No change to scheduling algorithm
- Used in Synopsys Behavioral Compiler
43Ordering and bandwidth reduction
RA
RB
RC
RC
RA
WB
RD
WC
WD
WA
read write dependencies
44Schedule in 6 cycles
A B B C C D D A A C B
A
B
B D
A
C
C
D
3 single port memories
Time
A B B C C D D A C B A
A
B
B D
A C
C
D
2 single port memories
Time