Clive Butler - PowerPoint PPT Presentation

About This Presentation
Title:

Clive Butler

Description:

Emerging methodology of designing embedded system utilizes configurable processors ... Customized cache allows designers to meet tighter energy consumption, ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: annEc6
Category:

less

Transcript and Presenter's Notes

Title: Clive Butler


1
Single-pass Cache Optimization
Clive Butler and Ruofan Yang
  • Clive Butler

2
Introduction of Problem
  • Embedded system execute a single application or a
    class of applications repeatedly
  • Emerging methodology of designing embedded system
    utilizes configurable processors
  • Size, associativity, and line size.
  • Energy model and an execution time model are
    developed to find the best cache configuration
    for the given embedded application.
  • Current processor design methodologies rely on
    reserving large enough chip area for caches while
    conforming with area, performance, and energy
    cost constraints.
  • Customized cache allows designers to meet tighter
    energy consumption, performance, and cost
    constraints.

3
Introduction of Problem
  • In existing low power processors, cache memory is
    known to consume a large portion of the on-chip
    energy
  • Cache consumes up to 43 to 50 of the total
    system power of a processor.
  • In embedded systems where a single application or
    a class of applications are repeatedly executed
    on a processor, the memory hierarchy could be
    customized such that an optimal configuration is
    achieved.
  • The right choice of cache configuration for a
    given application could have a significant impact
    on overall performance and energy consumption.

4
Introduction of Problem
  • Estimating the hit and miss rates is fairly easy
    using tools such as Dinero.
  • Can be enormously time consuming to do so for
    various cache sizes, associativities and line
    sizes.
  • To use Dinero to estimate cache miss rate for a
    number of cache configurations means that a large
    program trace needs to be repeatedly read and
    evaluated which is time consuming.
  • Very time consuming.

5
Dinero
  • Dinero is a trace-driven cache simulator
  • Simulations are repeatable
  • One can simulate either a unified cache (mixed,
    data and instructions cached together) or
    separate instruction and data caches.
  • Cheaper (Hardware)

6
Dinero
  • A din record is two- tuple label address.
  • Cache parameters are set by command line options
  • 0 read data, 1 write data, 2 instruction fetch. 3
    escape record, 4 escape record (causes cache
    flush).
  • Dinero uses the priority stack method of memory
    hierarchy simulation to increase flexibility and
    improve simulator performance in highly
    associative caches.

7
Introduction Method 1
Introduction Tree-base Method
  • Presents a methodology to rapidly and accurately
    explore the cache design space
  • Done by estimating cache miss rates for many
    different cache configurations simultaneously
    and investigate the effect of different cache
    configurations on the energy and performance of a
    system.
  • Simultaneous evaluation can be rapidly performed
    by taking advantage of the high correlation
    between cache behavior of different cache
    configurations.

8
ASP-DAC paper
General Simulation Process
m(max)..m(min)..0
Cache addr.
tag
Array (stores tree addresses)
Tree
Step 1 index
Step 3 Find node and go link list
Cache Miss Table
Step 2 Go to tree addr. and traverse the list
Link List
L N A of cache miss







Step 4 Look for match
9
ASP-DAC paper
Tree example
1010
101(0)
101(1)
Cache Size 2
10(10)
10(00)
10(11)
Cache Size 4
10(01)
Cache Size 8
1(010)
1(011)
1(101)
1(001)
1(110)
1(000)
1(100)
1(111)
Cache addr.
tag
1010
Assume each forest has fix line size
Bits are use find path (k)
10
ASP-DAC paper
Link list set associative
Assoc. 1
Assoc. 4
Assoc. 2
Hit
Hit
Hit
Miss
Most recent element used
Least recently used element
Table for Miss Count
of Cache Miss
N
L
A
1 4 1 0
1 4 1 1
Rest of address is use as tag
11
ASP-DAC paper
Link List LRU update
Assoc. 1
Assoc. 4
Assoc. 2
Most recent element used
Least recently used element
Table for Miss Count
of Cache Miss
N
L
A
1 4 1 0
1 4 1 1
12
Detail Trace Example
Example Specifications
  • Cache Size (N) will vary from 32 bits max to 2
    bits min
  • Associatively (A) will vary from 4 max to 1 min
  • Cache Set Size (M) will vary from 8 max to 1 min
  • Assume fix line size (L)

13
Detail Trace Example
Instruction Trace k m 1. 000000 gt 0 2.
001000 gt 8 3. 010000 gt 16 4. 000000 gt 0 5.
001000 gt 8 6. 000000 gt 0 7. 010000 gt 16
Assoc. 1
L N M Miss Count
1 8 8 3
1 4 4 3
1 2 2 5
1 1 1 7
1
2
3
1
2
3
M1
0
8
16
0
8
0
16
1
2
3
4
5
1
2
3
4
5
6
7
M2
0
1
0
8
16
0
8
0
16
M4
00
11
10
01
16
8
0
0
8
0
16
M8
000
001
010
011
100
101
110
111
16
8
0
0
8
0
16
14
Detail Trace Example
Instruction Trace k m 1. 000000 gt 0 2.
001000 gt 8 3. 010000 gt 16 4. 000000 gt 0 5.
001000 gt 8 6. 000000 gt 0 7. 010000 gt 16
Assoc. 2
L N M Miss Count
1 16 8 3
1 8 4 3
1 4 2 3
1 2 1 6
M1
0
8
0
16
8
0
16
8
0
0
8
16
0
1
2
3
4
5
6
M2
0
1
M4
00
11
10
01
M8
000
001
010
011
100
101
110
111
15
ASP-DAC Results
  • Using benchmarks from Mediabench
  • This method is on average 45 times faster to
    explore the design space.
  • compared to Dinero IV
  • Still having 100 accuracy.

16
Introduction Table-based Method
  • Two cache evaluation techniques include
    analytical modeling and execution-based
    evaluation to evaluate the design space
  • SPCE present a simplified, yet efficient way to
    extract locality properties for an entire cache
    configuration design space in just one
    single-pass
  • Includes related work, overview of SPCE,
    properties for addressing behavior analysis to
    estimate the cache miss rate, experiment and the
    results

17
Related Work
  • Much research exist in this area need multiple
    passes to explore all configurable parameters or
    employ large and complex data structures, which
    restricting their applicability
  • Algorithms for single-pass cache simulation exams
    concurrently a set of caches. Mattson Hill and
    Smith Sugumar and Abaham Cascaval and Padua
  • Janapsatya et al. present a technique to evaluate
    all different cache parameters simultaneously,
    but not designed with a hardware implementation
    in mind
  • This papers methodology use simple array
    structures which are more amenable to a
    light-weight hardware implementation

18
SPCE Overview
19
Definitions
  • Time ordered sequence of referenced addresses --
    Tt (t is a positive integer),length T, such
    that Tt is the t(th) address referenced
  • If Tti b Tti d b, then the
    addresses Tti and Tti d, are references to
    the same cache block of 2b words
  • Define d as the delay or the number of unique
    cache references occurring between any two
    references where Tti
  • b Tti d b

20
Definitions
  • Evaluate the locality in the sequence of
    addresses Tti of a running application ai by
    counting the occurrences where Tti
  • b Ttid b and registering it in
    the cell L(b, d) of the locality table.(2b is
    block size , d is delay)

21
Fully-Associative
  • A fully-associative cache configuration is
    defined by the notation cj (b, n), where b
    defines the line size in terms of words, and n
    the total number of lines in the cache
  • The locality table L(b, d) composes an efficient
    way to estimate the cache miss rate of
    fully-associative caches

22
Fully-Associative Example
Locality table for the trace T
T
t0 0
t1 8
t2 16
t3 0
t4 8
t5 0
t6 16
T b0
t0 0
t1 1
t2 2
t3 0
t4 1
t5 0
t6 2
d3
d b0
1 0
2 1
3 3
4 0
T b
d2
A sequence of addresses
23
Set-Associative
  • Most real-world cache devices are built as
    direct-map or set-associative structures
  • Since conflicts, L cannot be used to estimate
    misses , so define s as the number of sets
    independent of the associativity, for
    direct-mapped, set size1, sn
  • To analyze the cache conflicts, we build conflict
    table Ka (b is block size, s is set size), which
    in composed of a layers, one for each
    associativity explored

24
Set-Associative
25
Set-Associative
  • The value stored in each element of the table
    Ka(b, s) indicates how many times the same block
    (size 2b) is repeatedly referenced and results
    in a hit.
  • A given cache configuration with level of
    associativity w is capable of overcoming no more
    than w - 1 mapping conflicts.
  • The number of cache hits is determined by summing
    up the cache hits from layer a 1 up to its
    respective layer a w, where w refers to the
    associativity.

26
Algorithm Implementation
27
Experiment Setup
  • Implement SPCE as a standalone C program to
    process an instruction address trace file,
    gathered instruction address traces for 9
    arbitrarily chosen from Motorolas Power Stone
    benchmark suite using Simple Scalar
  • Since 64 bytes is the largest block size in the
    design space utilized, bmax3 smax is defined by
    configuration with the maximum number of sets in
    the design space
  • Exam performance for our suite of benchmarks with
    SPCE and also with a very popular trace-driven
    cache simulator (DineroIV)

28
Results
  • Compare performance of SPCE and DineroIV for the
    45 cache configurations.

29
Conclusion
  • Both Tree-based method and Table-based method
    (SPCE) facilitate in ease of cache miss rate
    estimation and also in reduction in simulation
    time.
  • Compared to DineroIV method, the average speedup
    is around 30 times.
  • Our future work includes extending the design
    space exploration by considering of a second
    level of cache.
Write a Comment
User Comments (0)
About PowerShow.com