Clive Butler - PowerPoint PPT Presentation

About This Presentation

Title:

Clive Butler

Description:

Emerging methodology of designing embedded system utilizes configurable processors ... Customized cache allows designers to meet tighter energy consumption, ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 30

Provided by: annEc6

Learn more at: http://www.ann.ece.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clive Butler

1
Single-pass Cache Optimization
Clive Butler and Ruofan Yang

Clive Butler

2
Introduction of Problem

Embedded system execute a single application or a
class of applications repeatedly
Emerging methodology of designing embedded system
utilizes configurable processors
Size, associativity, and line size.
Energy model and an execution time model are
developed to find the best cache configuration
for the given embedded application.
Current processor design methodologies rely on
reserving large enough chip area for caches while
conforming with area, performance, and energy
cost constraints.
Customized cache allows designers to meet tighter
energy consumption, performance, and cost
constraints.

3
Introduction of Problem

In existing low power processors, cache memory is
known to consume a large portion of the on-chip
energy
Cache consumes up to 43 to 50 of the total
system power of a processor.
In embedded systems where a single application or
a class of applications are repeatedly executed
on a processor, the memory hierarchy could be
customized such that an optimal configuration is
achieved.
The right choice of cache configuration for a
given application could have a significant impact
on overall performance and energy consumption.

4
Introduction of Problem

Estimating the hit and miss rates is fairly easy
using tools such as Dinero.
Can be enormously time consuming to do so for
various cache sizes, associativities and line
sizes.
To use Dinero to estimate cache miss rate for a
number of cache configurations means that a large
program trace needs to be repeatedly read and
evaluated which is time consuming.
Very time consuming.

5
Dinero

Dinero is a trace-driven cache simulator
Simulations are repeatable
One can simulate either a unified cache (mixed,
data and instructions cached together) or
separate instruction and data caches.
Cheaper (Hardware)

6
Dinero

A din record is two- tuple label address.
Cache parameters are set by command line options
0 read data, 1 write data, 2 instruction fetch. 3
escape record, 4 escape record (causes cache
flush).
Dinero uses the priority stack method of memory
hierarchy simulation to increase flexibility and
improve simulator performance in highly
associative caches.

7
Introduction Method 1
Introduction Tree-base Method

Presents a methodology to rapidly and accurately
explore the cache design space
Done by estimating cache miss rates for many
different cache configurations simultaneously
and investigate the effect of different cache
configurations on the energy and performance of a
system.
Simultaneous evaluation can be rapidly performed
by taking advantage of the high correlation
between cache behavior of different cache
configurations.

8
ASP-DAC paper
General Simulation Process
m(max)..m(min)..0
Cache addr.
tag
Array (stores tree addresses)
Tree
Step 1 index
Step 3 Find node and go link list
Cache Miss Table
Step 2 Go to tree addr. and traverse the list
Link List
L N A of cache miss

Step 4 Look for match
9
ASP-DAC paper
Tree example
1010
101(0)
101(1)
Cache Size 2
10(10)
10(00)
10(11)
Cache Size 4
10(01)
Cache Size 8
1(010)
1(011)
1(101)
1(001)
1(110)
1(000)
1(100)
1(111)
Cache addr.
tag
1010
Assume each forest has fix line size
Bits are use find path (k)
10
ASP-DAC paper
Link list set associative
Assoc. 1
Assoc. 4
Assoc. 2
Hit
Hit
Hit
Miss
Most recent element used
Least recently used element
Table for Miss Count
of Cache Miss
N
L
A
1 4 1 0
1 4 1 1
Rest of address is use as tag
11
ASP-DAC paper
Link List LRU update
Assoc. 1
Assoc. 4
Assoc. 2
Most recent element used
Least recently used element
Table for Miss Count
of Cache Miss
N
L
A
1 4 1 0
1 4 1 1
12
Detail Trace Example
Example Specifications

Cache Size (N) will vary from 32 bits max to 2
bits min
Associatively (A) will vary from 4 max to 1 min
Cache Set Size (M) will vary from 8 max to 1 min
Assume fix line size (L)

13
Detail Trace Example
Instruction Trace k m 1. 000000 gt 0 2.
001000 gt 8 3. 010000 gt 16 4. 000000 gt 0 5.
001000 gt 8 6. 000000 gt 0 7. 010000 gt 16
Assoc. 1
L N M Miss Count
1 8 8 3
1 4 4 3
1 2 2 5
1 1 1 7
1
2
3
1
2
3
M1
0
8
16
0
8
0
16
1
2
3
4
5
1
2
3
4
5
6
7
M2
0
1
0
8
16
0
8
0
16
M4
00
11
10
01
16
8
0
0
8
0
16
M8
000
001
010
011
100
101
110
111
16
8
0
0
8
0
16
14
Detail Trace Example
Instruction Trace k m 1. 000000 gt 0 2.
001000 gt 8 3. 010000 gt 16 4. 000000 gt 0 5.
001000 gt 8 6. 000000 gt 0 7. 010000 gt 16
Assoc. 2
L N M Miss Count
1 16 8 3
1 8 4 3
1 4 2 3
1 2 1 6
M1
0
8
0
16
8
0
16
8
0
0
8
16
0
1
2
3
4
5
6
M2
0
1
M4
00
11
10
01
M8
000
001
010
011
100
101
110
111
15
ASP-DAC Results

Using benchmarks from Mediabench
This method is on average 45 times faster to
explore the design space.
compared to Dinero IV
Still having 100 accuracy.

16
Introduction Table-based Method

Two cache evaluation techniques include
analytical modeling and execution-based
evaluation to evaluate the design space
SPCE present a simplified, yet efficient way to
extract locality properties for an entire cache
configuration design space in just one
single-pass
Includes related work, overview of SPCE,
properties for addressing behavior analysis to
estimate the cache miss rate, experiment and the
results

17
Related Work

Much research exist in this area need multiple
passes to explore all configurable parameters or
employ large and complex data structures, which
restricting their applicability
Algorithms for single-pass cache simulation exams
concurrently a set of caches. Mattson Hill and
Smith Sugumar and Abaham Cascaval and Padua
Janapsatya et al. present a technique to evaluate
all different cache parameters simultaneously,
but not designed with a hardware implementation
in mind
This papers methodology use simple array
structures which are more amenable to a
light-weight hardware implementation

18
SPCE Overview
19
Definitions

Time ordered sequence of referenced addresses --
Tt (t is a positive integer),length T, such
that Tt is the t(th) address referenced
If Tti b Tti d b, then the
addresses Tti and Tti d, are references to
the same cache block of 2b words
Define d as the delay or the number of unique
cache references occurring between any two
references where Tti
b Tti d b

20
Definitions

Evaluate the locality in the sequence of
addresses Tti of a running application ai by
counting the occurrences where Tti
b Ttid b and registering it in
the cell L(b, d) of the locality table.(2b is
block size , d is delay)

21
Fully-Associative

A fully-associative cache configuration is
defined by the notation cj (b, n), where b
defines the line size in terms of words, and n
the total number of lines in the cache
The locality table L(b, d) composes an efficient
way to estimate the cache miss rate of
fully-associative caches

22
Fully-Associative Example
Locality table for the trace T
T
t0 0
t1 8
t2 16
t3 0
t4 8
t5 0
t6 16
T b0
t0 0
t1 1
t2 2
t3 0
t4 1
t5 0
t6 2
d3
d b0
1 0
2 1
3 3
4 0
T b
d2
A sequence of addresses
23
Set-Associative

Most real-world cache devices are built as
direct-map or set-associative structures
Since conflicts, L cannot be used to estimate
misses , so define s as the number of sets
independent of the associativity, for
direct-mapped, set size1, sn
To analyze the cache conflicts, we build conflict
table Ka (b is block size, s is set size), which
in composed of a layers, one for each
associativity explored

24
Set-Associative
25
Set-Associative

The value stored in each element of the table
Ka(b, s) indicates how many times the same block
(size 2b) is repeatedly referenced and results
in a hit.
A given cache configuration with level of
associativity w is capable of overcoming no more
than w - 1 mapping conflicts.
The number of cache hits is determined by summing
up the cache hits from layer a 1 up to its
respective layer a w, where w refers to the
associativity.

26
Algorithm Implementation
27
Experiment Setup

Implement SPCE as a standalone C program to
process an instruction address trace file,
gathered instruction address traces for 9
arbitrarily chosen from Motorolas Power Stone
benchmark suite using Simple Scalar
Since 64 bytes is the largest block size in the
design space utilized, bmax3 smax is defined by
configuration with the maximum number of sets in
the design space
Exam performance for our suite of benchmarks with
SPCE and also with a very popular trace-driven
cache simulator (DineroIV)

28
Results

Compare performance of SPCE and DineroIV for the
45 cache configurations.

29
Conclusion

Both Tree-based method and Table-based method
(SPCE) facilitate in ease of cache miss rate
estimation and also in reduction in simulation
time.
Compared to DineroIV method, the average speedup
is around 30 times.
Our future work includes extending the design
space exploration by considering of a second
level of cache.

Write a Comment

User Comments (0)