Title: Compiler based Optimization Techniques for Scratchpad Memory
1Compiler based Optimization Techniques for
Scratchpad Memory
- Manish Verma, Peter Marwedel
- Department of Computer Science XII,
- University of Dortmund,
- Germany
2Outline
- Introduction
- Motivation
- Static Allocation Approach
- Scratchpad only architecture
- Cache Scratchpad architecture
- Dynamic Allocation Approach
- Scratchpad only architecture
- Conclusion Future Work
1 S. Steinke DATE, 2002
3Embedded Systems
- Embedded systems (ES) information processing
systems embedded into a larger product
Main reason for buying is not information
processing
- Transportation (e.g. ABS)
- Telecommunication (e.g. mobile phone)
- Manufacturing (incl. robotics)
- Medical instruments (e.g. artificial eye)
www.dobelle.com
4Power Issues
Power is considered as the most important
constraint in embedded systems in Eggermont
(ed) Embedded Systems Roadmap 2002, STW
5Power Distribution
- Memory subsystem consumes gt 50 of total
energy budget 1 - Memory hierarchy
- Cache Vs. Scratchpad
- Power 2
- Performance 2
- Predictability 3
- Software Support
1 S. Segars ISSCC, 2001 2 S.
Steinke DATE, 2002 3 P. Marwedel
ASPDAC, 2004
6Low Power/Energy Techniques are Essential
- Low energy dissipation is imperative for
battery-driven embedded systems - Low power techniques are essential to both
embedded systems and high performance processors
Skadron et al., 30th ISCA
Hot enough to cook an egg.
- High performance processors are going to be too
hot to work
7Outline
- Introduction
- Motivation
- Static Allocation Approach
- Scratchpad only architecture 1
- Cache Scratchpad architecture
- Dynamic Allocation Approach
- Scratchpad only architecture
- Conclusion Future Work
1 S. Steinke DATE, 2002
8Focus on memory- energy- aware
compilationScratch pad memories (SPM)
- Fast,
- energy-efficient,
- timing-predictable
Small no tag memory
9Scratchpad vs. main memory energy
Example Atmel ARM-Evaluation board
energy reduction/ 7.06 100 predictable
Prog?SPMData ?SPM
Prog?SPMData ? SPM
Prog?SPMData ? SPM
Prog?SPMData ? SPM
10Static Allocation (Scratchpad only)
int nat() real sin() char ch() int wh ()
Example
Which objects (functions, variables) to be stored
in SPM? Gain gm and size sm for each object
m. Maximise gain G ? gm, respecting constraint
K ? ? sm.
"main" memory
int p
Static memory allocation Knapsack problem
?
real a
SPMcapacity K
int c
11Static Allocation (Scratchpad only)
Symbols s(varm ) size of variable m n(varm)
number of accesses to variable m e(varm )
energy saved per variable access, if varm is
migrated E(varm ) energy saved if varm is
migrated ( e(varm ) n(varm )) x(varm ) 1 if
variable m is migrated to SPM, else 0 M set of
variables Similar for functions. Integer
programming formulation Maximize ?i?I x(Fi )
E(Fi ) ?m?M x(varm ) E(varm ) Subject to the
constraint ?i ?I s (Fi ) x(Fi ) ?m ?M s (varm )
x(varm ) ? K
12Results (Energy Runtime)
Feasible with standard compiler postpass
optimization
Cycles
Multi_sort (mix of sort algorithms)
13Outline
- Introduction
- Motivation
- Static Allocation Approach
- Scratchpad only architecture
- Cache Scratchpad architecture
- Dynamic Allocation Approach
- Scratchpad only architecture
- Conclusion Future Work
14Static Allocation (Cache Scratchpad)
- Caches Scratchpads
- I-Mem subsystem
- Trace Generation
- memory objects (MO)
- Conflict Graph
- models I-Cache behavior
- interaction of MOs
- Fine Grained Energy Model
- cache hits
- cache misses
15Example
- B1 ((B2 B5 B6 B7)9 (B2 B3 B4 B7)))10 B8
B1
B7
B2
100, 0
B3
B5
10, 10
B6
B4
I-Cache
100, 0
90, 10
Total Cache Misses 40
16Trace Generation
- Min jumps across traces
- NP Complete problem
- Greedy approach
- Coalesce most freq exec BB
- Size of trace lt Scratchpad Size
- Append NOPs
- Reduce i-cache misses
- Improve processor cycles
17Conflict Graph
T4 ((T1 T2 T1)9 (T1 T3 T1)))10 T5
180,20
200,0
20,20
- Weighted Directed Graph
- Nodes (traces)
- Execution frequency
- Edges (conflict relationship)
- conflict misses
18Energy Model
Constant
Variable (program layout)
19Energy Model (Example)
- ECache_hit 1
- ECache_miss 10
- ECache(T2) 180 1 20 (10-1) 360
- E(Total) 760
- Energy consumption in Cache
- program layout
- execution frequencies insufficient
20Problem Formulation
20
- NP-complete
- Knapsack (no edges)
- Maximum Independent Set
- (ESP_Hit ECache_Hit)
- Integer Linear Programming /
- Greedy Heuristic
T2 (180)
T3 (20)
20
T5 (1)
360
200
T4 (1)
T1 (200)
200
Conflict Graph
- Formal Problem Formulation
- Given conflict graph (G), scratchpad, i-cache,
energy model - Determine Min. energy mapping
- Assumption No new edges copying traces
21Solution (Example)
T4 ((T1 T2 T1)9 (T1 T3 T1)))10 T5
20
NOP
T2 (180)
T3 (20)
0
T2
NOP
20
1
T5 (1)
360
200
90
20
2
T1
3
T4 (1)
T1 (200)
I-Cache
4
T3
5
200
Conflict Graph
T2
T4
6
T5
7
Scratchpad
I-Mem
22Energy Consumption (I-Cache)
MPEG benchmark
23Energy Consumption (Cache Scratchpad)
8kB DM I-Cache
MPEG benchmark
24Energy Consumption (Cache Scratchpad)
Static Allocation (Scratchpad only)
MPEG 20kB Cache 2K DM
25Outline
- Introduction
- Motivation
- Static Allocation Approach
- Scratchpad only architecture
- Cache Scratchpad architecture
- Dynamic Allocation Approach (Scratchpad Overlay)
- Scratchpad only architecture
- Conclusion Future Work
26Motivation (Dynamic Allocation)
SPILL_LOAD(A) for (i0ilt100i) Ai
for (j0jlt100j) Aj SPILL_STORE(
A) SPILL_LOAD(B) for (i0ilt100i)
Bi for (j0jlt100j)
Bj SPILL_STORE(B)
Main Memory
Scratchpad Memory
- Dynamic Allocation (Scratchpad Overlay)
- increased scratchpad utilization
- overhead due to spill routines
- similar to register allocation
27Comparison against Register Allocation
Processor
Data Path
Scratchpad
Register File
RISC
- Scarce Resource (Register File / Scratchpad)
- Life-time of variables (temp. regs. / vars
code) - Similar to RA for CISC, not for RISC processors
- Memory objects (vars code) are of various sizes
28Workflow (Scratchpad Overlay)
- Memory Object Determination
- Scratchpad Overlay
- Memory Assignment
- Onchip Address Assignment
29Memory Object Determination
- Memory Objects
- Global Variables (A)
- Non-Scalar Local Variables
- Traces (T1, T2, T3, T4)
B1
B2
B3
B5
MO A, T1, T2, T3, T4
B4
B6
B7
B8
30Liveness Analysis
- DEF-MOD-USE
- Vars Profiling Info.
- Traces Static Analysis
B1
DEF A
B2
MOD A
USE T3
B3
B5
LiveRange fixed-point iterative method
B4
B6
USE A
USE T3
B7
USE T4
B8
USE A
USE T4
31Memory Assignment
- Given MOs, LiveRanges, Scratchpad
- Determine Memory Assignment of MOs
- Assumption Onchip address to MOs can be assigned
- Discussion NP-complete, reduces to register
allocation - Solutions
- Optimal ILP formulation (16 sec.)
- Near Optimal Heuristic
Processor
Scratchpad
Main Memory
32Memory Assignment (Solution)
- MO A, T1, T2, T3, T3
- SP Size A T1 T4
B1
DEF A
B2
B9
SPILL_STORE(A) SPILL_LOAD(T3)
MOD A
USE T3
B3
B5
Solution A ? SP T3 ? SP
B4
B6
USE T3
USE A
B7
B10
SPILL_LOAD(A)
B8
USE A
33Onchip Address Assignment
Fragmentation Problem
0
- Given Memory Assignment, Scratchpad
- Determine Onchip Address (Offset) of MOs
- Discussion NP-complete, reduces to
Ship- Building problem - Solution
- Optimal MIP formulation (4 hours)
- Near Optimal First-fit, Best-fit heuristic
20
40
60
Scratchpad
34Onchip Address Assignment (MIP)
Oij Offset of Memory Object moj at edge ei
Oik
Oij
Non-Overlap Constraints
Invariance Constraints
35Results (Edge Detection)
1/8th Scratchpad
36Results (SO vs. SA)
Static Allocation
21
22
43
64
Edge Detection
37Results (SO vs. SA)
36
Static Allocation
34
38Conclusion Future Work
- Scratchpads are energy efficient memories.
- Software allocation methods
- Static Allocation Approach
- avg. 30 reduction in energy consumption
- SP I-Cache is better than best I-Cache
- Dynamic Allocation Approach
- avg. 30 reduction in energy consumption
- Future Work
- Multi-memory / Multi-Process.
- Near-optimal solutions.
39Multi-process Scratchpad Allocation Strategies
- Static Allocation (SAMP)
- Distributes SPM into non-overlapping regions
- Good for large scratchpads
Static Region
- Dynamic Allocation (DAMP)
- Single common region for all processes
- Good for small scratchpads
Dynamic Region
- Hybrid Allocation (HAMP)
- Static Dynamic regions
- Good for all scratchpads
Scratchpad
40Results
adpcm, g721, mpeg, edge_detection
41Objective Function (ILP)
Objective Function Energy Savings
Energy reduction by assigning mok to Scratchpad
at edge ei
Maximize
Energy overhead of loading mok to Scratchpad at
edge ei
Energy overhead of storing mok from Scratchpad to
Main memory edge ei
42Constraints (ILP)
DEF Constraint
DEF mok
USE/MOD/CONT Constraint
ei
STORE mok
CONT mok
ej
43Memory Assignment (ILP)
- ILP inequations for edges, not for basic block
nodes - Attributes on edges AttribSTATIC and AttribSPILL
AttribSTATIC DEF,MOD,USE,CONT
AttribSPILL LOAD,STORE
DEF gt MOD gt USE gt CONT
44Edge Attributes (example)
- AttribSPILL STORE, LOAD
- STORE attribute
- edges with DEF attribute
- in-edges of a merge node
- LOAD attribute
- edges with USE, MOD, CONT attribute
- out-edges of a diverge node
B1
DEF A
DEF STORE
DEF
MOD LOAD
MOD
B2
CONT LOAD
MOD A
B3
B5
USE A
B4
B6
CONT STORE
B7
CONT
CONT STORE
USE
USE LOAD
B8
USE A
45Preloaded Loop Cache
- Nice balance Cache Scratchpad
- Little or no software support
- Predictable I-Cache behavior
- Energy overhead of Controller
- Predetermined number of memory objects (2-8)
- Strong dependence on application
Processor
Controller
Loop Cache
I-Cache
46Motivation
- Static Scratchpad Allocation
100, 0
10, 10
I-Cache
100, 0
B7
90, 10
Scratchpad
Total Cache Misses 200
47Cache Aware Scratchpad Allocation (CASA)
Algorithm
- Trace Generation
- memory objects (MO)
- Conflict Graph
- models cache behavior
- interaction of MO
- Energy Model
- Integer Linear Inequations
48Trace Generation
B1 ((B2 B5 B6 B7)9 (B2 B3 B4 B7)))10 B8
T4 ((T1 T2 T1)9 (T1 T3 T1)))10 T5
180,20
200,0
20,20
49SP (CASA) vs. LC (Ross) Energy
MPEG 20kB Cache 2K DM 4 Memory Obj.
Loop Cache (Ross)
50SP (CASA), SP (Steinke) vs. LC (Ross)Energy
Loop Cache (Ross)
51Characteristics of Embedded Systems
- Dependability
- Reliability
- Safety
- Security/privacy
- Meeting real-time constraints
- Reactive (? finite state machine)
- Specialized user interface
- Efficiency (weight, energy, price)
- Analogue and digital components
- Sensors, connected to physical environment