Title: Clive Butler
1Single-pass Cache Optimization
Clive Butler and Ruofan Yang
2Introduction of Problem
- Embedded system execute a single application or a
class of applications repeatedly - Emerging methodology of designing embedded system
utilizes configurable processors - Size, associativity, and line size.
- Energy model and an execution time model are
developed to find the best cache configuration
for the given embedded application. - Current processor design methodologies rely on
reserving large enough chip area for caches while
conforming with area, performance, and energy
cost constraints. - Customized cache allows designers to meet tighter
energy consumption, performance, and cost
constraints.
3Introduction of Problem
- In existing low power processors, cache memory is
known to consume a large portion of the on-chip
energy - Cache consumes up to 43 to 50 of the total
system power of a processor. - In embedded systems where a single application or
a class of applications are repeatedly executed
on a processor, the memory hierarchy could be
customized such that an optimal configuration is
achieved. - The right choice of cache configuration for a
given application could have a significant impact
on overall performance and energy consumption.
4Introduction of Problem
- Estimating the hit and miss rates is fairly easy
using tools such as Dinero. - Can be enormously time consuming to do so for
various cache sizes, associativities and line
sizes. - To use Dinero to estimate cache miss rate for a
number of cache configurations means that a large
program trace needs to be repeatedly read and
evaluated which is time consuming. - Very time consuming.
5Dinero
- Dinero is a trace-driven cache simulator
- Simulations are repeatable
- One can simulate either a unified cache (mixed,
data and instructions cached together) or
separate instruction and data caches. - Cheaper (Hardware)
6Dinero
- A din record is two- tuple label address.
- Cache parameters are set by command line options
- 0 read data, 1 write data, 2 instruction fetch. 3
escape record, 4 escape record (causes cache
flush). - Dinero uses the priority stack method of memory
hierarchy simulation to increase flexibility and
improve simulator performance in highly
associative caches.
7Introduction Method 1
Introduction Tree-base Method
- Presents a methodology to rapidly and accurately
explore the cache design space - Done by estimating cache miss rates for many
different cache configurations simultaneously
and investigate the effect of different cache
configurations on the energy and performance of a
system. - Simultaneous evaluation can be rapidly performed
by taking advantage of the high correlation
between cache behavior of different cache
configurations.
8ASP-DAC paper
General Simulation Process
m(max)..m(min)..0
Cache addr.
tag
Array (stores tree addresses)
Tree
Step 1 index
Step 3 Find node and go link list
Cache Miss Table
Step 2 Go to tree addr. and traverse the list
Link List
L N A of cache miss
Step 4 Look for match
9ASP-DAC paper
Tree example
1010
101(0)
101(1)
Cache Size 2
10(10)
10(00)
10(11)
Cache Size 4
10(01)
Cache Size 8
1(010)
1(011)
1(101)
1(001)
1(110)
1(000)
1(100)
1(111)
Cache addr.
tag
1010
Assume each forest has fix line size
Bits are use find path (k)
10ASP-DAC paper
Link list set associative
Assoc. 1
Assoc. 4
Assoc. 2
Hit
Hit
Hit
Miss
Most recent element used
Least recently used element
Table for Miss Count
of Cache Miss
N
L
A
1 4 1 0
1 4 1 1
Rest of address is use as tag
11ASP-DAC paper
Link List LRU update
Assoc. 1
Assoc. 4
Assoc. 2
Most recent element used
Least recently used element
Table for Miss Count
of Cache Miss
N
L
A
1 4 1 0
1 4 1 1
12Detail Trace Example
Example Specifications
- Cache Size (N) will vary from 32 bits max to 2
bits min - Associatively (A) will vary from 4 max to 1 min
- Cache Set Size (M) will vary from 8 max to 1 min
- Assume fix line size (L)
13Detail Trace Example
Instruction Trace k m 1. 000000 gt 0 2.
001000 gt 8 3. 010000 gt 16 4. 000000 gt 0 5.
001000 gt 8 6. 000000 gt 0 7. 010000 gt 16
Assoc. 1
L N M Miss Count
1 8 8 3
1 4 4 3
1 2 2 5
1 1 1 7
1
2
3
1
2
3
M1
0
8
16
0
8
0
16
1
2
3
4
5
1
2
3
4
5
6
7
M2
0
1
0
8
16
0
8
0
16
M4
00
11
10
01
16
8
0
0
8
0
16
M8
000
001
010
011
100
101
110
111
16
8
0
0
8
0
16
14Detail Trace Example
Instruction Trace k m 1. 000000 gt 0 2.
001000 gt 8 3. 010000 gt 16 4. 000000 gt 0 5.
001000 gt 8 6. 000000 gt 0 7. 010000 gt 16
Assoc. 2
L N M Miss Count
1 16 8 3
1 8 4 3
1 4 2 3
1 2 1 6
M1
0
8
0
16
8
0
16
8
0
0
8
16
0
1
2
3
4
5
6
M2
0
1
M4
00
11
10
01
M8
000
001
010
011
100
101
110
111
15ASP-DAC Results
- Using benchmarks from Mediabench
- This method is on average 45 times faster to
explore the design space. - compared to Dinero IV
- Still having 100 accuracy.
16Introduction Table-based Method
- Two cache evaluation techniques include
analytical modeling and execution-based
evaluation to evaluate the design space - SPCE present a simplified, yet efficient way to
extract locality properties for an entire cache
configuration design space in just one
single-pass - Includes related work, overview of SPCE,
properties for addressing behavior analysis to
estimate the cache miss rate, experiment and the
results
17Related Work
- Much research exist in this area need multiple
passes to explore all configurable parameters or
employ large and complex data structures, which
restricting their applicability - Algorithms for single-pass cache simulation exams
concurrently a set of caches. Mattson Hill and
Smith Sugumar and Abaham Cascaval and Padua - Janapsatya et al. present a technique to evaluate
all different cache parameters simultaneously,
but not designed with a hardware implementation
in mind - This papers methodology use simple array
structures which are more amenable to a
light-weight hardware implementation
18SPCE Overview
19Definitions
- Time ordered sequence of referenced addresses --
Tt (t is a positive integer),length T, such
that Tt is the t(th) address referenced - If Tti b Tti d b, then the
addresses Tti and Tti d, are references to
the same cache block of 2b words - Define d as the delay or the number of unique
cache references occurring between any two
references where Tti - b Tti d b
20Definitions
- Evaluate the locality in the sequence of
addresses Tti of a running application ai by
counting the occurrences where Tti - b Ttid b and registering it in
the cell L(b, d) of the locality table.(2b is
block size , d is delay)
21Fully-Associative
- A fully-associative cache configuration is
defined by the notation cj (b, n), where b
defines the line size in terms of words, and n
the total number of lines in the cache - The locality table L(b, d) composes an efficient
way to estimate the cache miss rate of
fully-associative caches
22Fully-Associative Example
Locality table for the trace T
T
t0 0
t1 8
t2 16
t3 0
t4 8
t5 0
t6 16
T b0
t0 0
t1 1
t2 2
t3 0
t4 1
t5 0
t6 2
d3
d b0
1 0
2 1
3 3
4 0
T b
d2
A sequence of addresses
23Set-Associative
- Most real-world cache devices are built as
direct-map or set-associative structures - Since conflicts, L cannot be used to estimate
misses , so define s as the number of sets
independent of the associativity, for
direct-mapped, set size1, sn - To analyze the cache conflicts, we build conflict
table Ka (b is block size, s is set size), which
in composed of a layers, one for each
associativity explored
24Set-Associative
25Set-Associative
- The value stored in each element of the table
Ka(b, s) indicates how many times the same block
(size 2b) is repeatedly referenced and results
in a hit. - A given cache configuration with level of
associativity w is capable of overcoming no more
than w - 1 mapping conflicts. - The number of cache hits is determined by summing
up the cache hits from layer a 1 up to its
respective layer a w, where w refers to the
associativity.
26Algorithm Implementation
27Experiment Setup
- Implement SPCE as a standalone C program to
process an instruction address trace file,
gathered instruction address traces for 9
arbitrarily chosen from Motorolas Power Stone
benchmark suite using Simple Scalar - Since 64 bytes is the largest block size in the
design space utilized, bmax3 smax is defined by
configuration with the maximum number of sets in
the design space - Exam performance for our suite of benchmarks with
SPCE and also with a very popular trace-driven
cache simulator (DineroIV)
28Results
- Compare performance of SPCE and DineroIV for the
45 cache configurations.
29Conclusion
- Both Tree-based method and Table-based method
(SPCE) facilitate in ease of cache miss rate
estimation and also in reduction in simulation
time. - Compared to DineroIV method, the average speedup
is around 30 times. - Our future work includes extending the design
space exploration by considering of a second
level of cache.