2K papers on caches by Y2K: Do we need more - PowerPoint PPT Presentation

About This Presentation

Title:

2K papers on caches by Y2K: Do we need more

Description:

Page coloring. Many different write policies ... For overlap: lock-up free caches. For latency reduction: prefetch ... Active Pages (Chong et al. 1998) ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 47

Provided by: cse46

Learn more at: https://homes.cs.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: 2K papers on caches by Y2K: Do we need more

1
2K papers on caches by Y2KDo we need more?

Jean-Loup Baer
Dept. of Computer Science Engineering
University of Washington

2
A little bit of history

The Y0K problem

3
A little bit of history

The Y0K problem
The Y1K problem

4
A little bit of history

The Y0K problem
The Y1K problem
Pour la version française, qui était Roi de
France en lan 1000?

5
Outline

More history
Anthology
Challenges
Conclusion

6
More history

Caches introduced (commercially) more than 30
years ago in the IBM 360/85
already a processor-memory gap
Oblivious to the ISA
caches were organization, not architecture
Sector caches
to minimize tag area
Single level off-chip

7
Terminology

One of the original designers (Gibson) had first
coined the name muffer
When papers were submitted, the authors (Conti,
Gibson, Liptay, Pitkovsky) used the term
high-speed buffer
The EIC of IBM Systems Journal (R.L.Johnson)
suggested a more sexy name, namely cache, after
consulting a thesaurus

8
Today

Caches are ubiquitous
On-chip, off-chip
But also, disk caches, web caches, trace caches
etc.
Multilevel cache hierarchy
With inclusion or exclusion
Many different organizations
direct-mapped, set-associative,
skewed-associative, sector, decoupled sector etc.

9
Today (ced)

Cache exposed to the ISA
Prefetch, Fence, Purge etc.
Cache exposed to the compiler
Code and data placement
Cache exposed to the O.S.
Page coloring
Many different write policies
copy-back, write-through, fetch-on-write,
write-around, write-allocate etc.

10
Today (ced)

Numerous cache assists, for example
For storage write-buffers, victim caches,
temporal/spatial caches
For overlap lock-up free caches
For latency reduction prefetch
For better cache utilization bypass mechanisms,
dynamic line sizes
etc ...

11
Caches and Parallelism

Cache coherence
Directory schemes
Snoopy protocols
Synchronization
Test-and-test-and-set
load linked -- store conditional
Models of memory consistency

12
When were the 2K papers being written?

A few facts
1980 textbook
1996 textbook 120 pages on caches (20)
Smith survey (1982)
About 40 references on caches
Uhlig and Mudge survey on trace-driven simulation
(1997)
About 25 references specific to cache performance
only
Many more on tools for performance etc.

13
Cache research vs. time
Largest number (14)
1st session on caches
14
Outline

More history
Anthology
Challenges
Conclusion

15
Some key papers - Cache Organization

Conti (Computer 1969) direct-mapped (cf. slave
memory and tags in Wilkes 1965),
set-associativity
Bell et al (IEEE TC 1974) cache design for small
machines (advocated unified caches pipelining
nullified that )
Hill (Computer 1988) the case for direct-mapped
caches (technology has made the case obsolete)
Smith (Computing Surveys 1982) virtual vs.
physical addressing (first cogent discussion)

16
Some key papers - Qualitative Properties

Smith (Computing Surveys 1982) Spatial and
temporal locality
Hill (Ph.D 1987) The three Cs
Baer and Wang (ISCA 1988) Multi-level inclusion

17
Some key papers - Cache Evaluation Methodology

Belady (IBM Systems J. 1966) MIN and OPT
Mattson et al. (IBM Systems J. 1970) The stack
property
Trace collection
Hardware Clark (ACM TOCS 1983)
Microcode Agarwal, Sites and Horowitz (ISCA
1986) ATUM
Software M. Smith (1991) Pixie
Very long traces Borg, Kessler and Wall (ISCA
1990)

18
Some key papers - Cache Performance

Kaplan and Winder (Computer 1973) 8 to 16K
caches with block sizes of 64 to 128 bytes and
set-associativity 2 or 4 will yield hit ratios of
over 95
Strecker (ISCA 1976) Design of the PDP 11/70 --
2KB, 2-way set-associative, 4 byte (2 words)
block size
Smith (Computing Surveys 1982)Most comprehensive
study of the time prefetching, replacement,
associativity, line size etc.
Przybylski et al. (ISCA 1988) Comprehensive
study 6 years later
Woo et al. (ISCA 1995) Splash-2

19
Some key papers - Cache Assists

IBM ?? Write buffers
Gindele (IBM TD Bull 1977) OBL prefetch (OBL
coined by Smith?)
Kroft (ISCA 1981) Lock-up free caches
Jouppi (ISCA 1990) Victim caches stream buffers
Pettis and Hansen (PLDI 1990) Code placement

20
Some key papers - Cache Coherence

Censier and Feautrier (IEEE TC 1978) Directory
scheme
Goodman (ISCA 1983) The first snoopy protocol
Archibald and Baer (TOCS 1986) Snoopy
terminology
Dubois, Scheurich and Briggs (ISCA 1986) Memory
consistency

21
Outline

More history
Anthology
Challenges
Conclusion

22
Caches are great. Yes but

Caches are poorly utilized
Lots of dead lines (only 20 efficiency - Burger
et al 1995)
Squandering of memory bandwidth
The memory wall
At the limit, it will take longer to load a
program on-chip than to execute it (Wulf and
McKee 1995)

23
Solution Paradigms

Revolution
Evolution
Enhancements

24
Revolution
25
Evolution (processor in memoryapplication
specific)

IRAM (Patterson et al. 1997)
Vector processor data stream apps low power
FlexRAM (Torrellas et al. 1999)
Memory chip Simple multiprocessor superscalar
banks of DRAM memory intensive apps.
Active Pages (Chong et al. 1998)
Co-processor paradigm reconfigurable logic in
memory apps such as scatter-gather
FBRAM (Deering et al. 1994)
Graphics in memory

26
Enhancements

Hardware and software cache assists
Examples hardware tables most common case
resolved in hardware less common in software
Use real estate on-chip to provide intelligence
for managing on-chip and off-chip hierarchy
Examples memory controller, prefetch engines for
L2 on processor chip

27
General Approach

Identify a cache parameter/enhancement whose
tuning will lead to better performance
Assess potential margin of improvement
Propose and design an assist
Measure efficiency of the scheme

28
Identify a cache parameter/enhancement

The creative part!
Our current projects
Dynamic line sizes
Modified LRU policies using detection of temporal
locality
Prefetching in L2

29
Assess potential margin of improvement

Metrics?
Miss rate bandwidth average memory access time
Weighted combination of some of the above
Execution time
Compare to optimal (off-line) algorithm
Easy for replacement algorithms
OK for some other metrics (e.g., cost of a
cache miss depending on line size oracle for
prefetching)
Hard for execution time

30
Measure efficiency of the scheme

Same problem metrics?
The further from the processor, the more
relaxed the metric
For L1-L2, you need to see impact on execution
speed
For L2- DRAM, you can get away with average
memory access time

31
Anatomy of a Predictor
Exec.
Event selec.
Pred. Index.
Recovery?
Pred. Mechan.
Feedback
32
Anatomy of a Cache Predictor
Exec.
Event selec.
Pred. Index.
Pred. Mechan.
Feedback
33
Anatomy of a Cache Predictor
Load/storecache miss
Exec.
Pred. trigger.
Pred. Index.
Pred. Mechan.
Feedback
34
Anatomy of a Cache Predictor
PC EA global/local history
Exec.
Pred. trigger.
Pred. Index.
Pred. Mechan.
Feedback
35
Anatomy of a Cache Predictor
Exec.
Pred. trigger.
Pred. Index.
One level table Two level tables Associative
buffers Specialized caches
Pred. Mechan.
Feedback
36
Anatomy of a Cache Predictor
Exec.
Pred. trigger.
Pred. Index.
Pred. Mechan.
Feedback
Counters Stride predictors Finite
context Markov pred.
37
Anatomy of a Cache Predictor
Exec.
Pred. trigger.
Pred. Index.
Pred. Mechan.
Feedback
Often imprecise
38
Applying the Model

Modified LRU policies for L2 caches
Identify a cache parameter
L2 cache miss rate

39
Applying the Model

Modified LRU policies for L2 caches
Identify a cache parameter
Assess potential margin of improvement
OPT vs. LRU

40
Applying the Model

Modified LRU policies for L2 caches
Identify a cache parameter
Assess potential margin of improvement
Propose a design
On-line detection of lines exhibiting temporal
locality

41
Propose a Design
L1 cache miss
EA PC
Exec.
Event selec.
Pred. Index.
Metadata in L2 Locality Table
Pred. Mechan.
Feedback
LRU stack locality bit
42
Applying the Model

Modified LRU policies for L2 caches
Identify a cache parameter
Assess potential margin of improvement
Propose a design
Measure efficiency of the scheme
How much of the margin of improvement was reduced
(i.e., compare with OPT and LRU)

43
Conclusion

Do we need more?
We need substantive research on the design of
memory hierarchies that reduce or hide access
latencies while they deliver the memory
bandwidths required by current and future
applications PITAC Report Feb 1999

44
Possible important areas of research