Improving Cache Performance - PowerPoint PPT Presentation

About This Presentation

Title:

Improving Cache Performance

Description:

Page fault. Address fault. Memory mapping (address translation) ... Fetches from predicted path. FPU. Five functional units: Add. Multiply. Divide/square root ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 47

Provided by: csgw

Category:

more less

Transcript and Presenter's Notes

Title: Improving Cache Performance

1
(No Transcript)
2
Improving Cache Performance
AMAT Hit time Miss rate Miss penalty

Four categories of optimisation
Reduce miss rate
Reduce miss penalty
Reduce miss rate or miss penalty using
parallelism
Reduce hit time

3
5.5. Reducing Miss Rate

Three sources of misses
Compulsory
cold start misses
Capacity
Cache is full
Conflict
Set is full/block is occupied

Increase block size
Increase size of cache
Increase degree of associativity
4
Larger Block Size

Bigger blocks reduce compulsory misses
Spatial locality
BUT
Increased miss penalty
More data to transfer
Possibly increased overall miss rate
More conflict and capacity misses as there are
fewer blocks

5
Effect of Block Size
6
Larger Caches

Reduces capacity misses
Increases hit time and cost

7
Higher Associativity

Miss rates improve with higher associativity
Two rules of thumb
8-way set associative caches are almost as
effective as fully associative
But much simpler!
21 cache rule
A direct mapped cache of size N has about the
same miss rate as a 2-way set associative cache
of size N/2

8
Way Prediction

Set-associative cache predicts which block will
be needed on next access to the set
Only one tag check is done
If mispredicted the whole set must be checked
E.g. Alpha 21264 instruction cache
Prediction rate gt 85
Correct prediction 1 cycle hit
Misprediction 3 cycles

9
Pseudo-Associative Caches

Check a direct mapped cache for a hit as usual
If it misses, check a second block
Invert MSB of index
One fast and one slow hit time

10
Compiler Optimisations

Compilers can optimise code to minimise miss
rates
Reordering procedures
Aligning basic blocks with cache blocks
Reorganising array element accesses

11
5.6. Reduce Miss Rate or Miss Penalty via
Parallelism

Three techniques that overlap instruction
execution with memory access

12
Nonblocking caches

Dynamic scheduling allows CPU to continue with
other instructions while waiting for data
Nonblocking cache allows other cache accesses to
continue while waiting for data

13
Hardware Prefetching

Fetch data/instructions before they are requested
by the processor
Either into cache or another buffer
Particularly useful for instructions
High degree of spatial locality
UltraSPARC III
Special prefetch cache for data
Increases effectiveness by about four times

14
Compiler Prefetching

Compiler inserts prefetch instructions
Two types
Prefetch register value
Prefetch data cache block
Can be faulting or non-faulting
Cache continues as normal while data is prefetched

15
SPARC V9

Prefetch

prefetch rs1 rs2, fcn prefetch rs1
imm13, fcn
fcn Prefetch function 0 Prefetch for
several reads 1 Prefetch for one read 2
Prefetch for several writes 3 Prefetch
for one write 4 Prefetch page
16
5.7. Reducing Hit Time

Critical
Often affects CPU clock cycle time

17
Small, simple caches

Small usually equals fast in hardware
A small cache may reside on the processor chip
Decreases communication
Compromise tags on chip, data separate
Direct mapped
Data can be read in parallel with tag checking

18
Avoiding address translation

Physical caches
Use physical addresses
Address translation must happen before cache
lookup
Virtual caches
Use virtual addresses
Protection issues
High context switching overhead

19
Virtual caches

Minimising context switch overhead
Add process-identifier tag to cache
Multiple virtual addresses may refer to a single
physical address
Hardware enforces anti-aliasing
Software requires less significant bits to be the
same

20
Avoiding address translation (cont.)

Choice of page size
Bigger than cache index offset
Address translation and tag lookup can happen in
parallel

21
Pipelining cache access

Split cache access into several stages
Impacts on branch and load delays

22
Trace caches

Blocks follow program flow rather than spatial
locality!
Branch prediction is taken into account by cache
Intel NetBurst microarchitecture
Complicates address mapping
Minimises wasted space within blocks

23
Cache OptimisationSummary

Cache optimisation is very complex
Improving one factor may have a negative impact
on another

24
5.6. Main Memory

Latency and bandwidth are both important
Latency is composed of two factors
Access time
Cycle time
Two main technologies
DRAM
SRAM

25
5.7. Virtual Memory

Physical memory is divided into blocks
Allocated to processes
Provides protection
Allows swapping to disk
Simplifies loading
Historically
Overlays
Programmer controlled swapping

26
Terminology

Block
Page
Segment
Miss
Page fault
Address fault
Memory mapping (address translation)
Virtual address ? physical address

27
Characteristics

Block size
4kB 64kB
Hit time
50 150 cycles
Miss penalty
1 000 000 10 000 000 cycles
Miss Rate
0.000 01 0.001

?
?
28
Categorising VM Systems

Fixed block size
Pages
Variable block size
Segments
Difficult replacement
Hybrid approaches
Paged segments
Multiple page sizes (2n smallest)

29
Q1 Block placement?

Anywhere in memory
Fully associative
Minimises miss rate

30
Q2 Block identification?

Page/segment number gives physical page address
Paging offset concatenated
Segments offset added
Uses a page table
Number of pages in virtual address space
Save space inverted page table
Number of pages in physical memory

31
Q3 Block replacement?

Least-recently used (LRU)
Minimises miss rate
Hardware provides a use bit or reference bit

32
Q4 Write strategy?

Write back
With a dirty bit

You wont become famous by being the first to try
write through!
33
Fast Address Translation

Page tables are big
Stored in memory themselves
Two memory accesses for every datum!

Principle of locality
Cache recent translations
Translation look-aside buffer (TLB), or
translation buffer (TB)

34
Alpha 21264 TLB
35
Selecting a Page Size

Big
Smaller page table
Allows parallel cache access
Efficient disk transfers
Reduces TLB misses
Small
Less memory wastage (internal fragmentation)
Quicker process startup

36
Putting it ALL Together!

SPARC Revisited

37
Two SPARCs

SuperSPARC
1992
32-bit superscalar design
UltraSPARC
Late 1990s
64-bit design
Graphics support (VIS)

38
UltraSPARC

Four-way superscalar execution
Two integer ALUs
FP unit
Five functional units
Graphics unit

39
Pipeline

9 stages
Fetch
Decode
Grouping
Execution
Cache access
Load miss
Integer pipe wait (for FP/graphics pipelines)
Trap resolution
Writeback

40
Branch Handling

Dynamic branch prediction
Two bit scheme
Every second instruction in cache has prediction
bits (predicts up to 2048 branches)
88 success rate (integer)
Target prediction
Fetches from predicted path

41
FPU

Five functional units
Add
Multiply
Divide/square root
Two graphics units (add and multiply)
Mostly fully pipelined (latency 3 cycles)
Except divide and square root (not pipelined,
latency is 22 cycles for 64-bit)

42
Memory Hierarchy

On-chip instruction and data caches
Data
16kB direct-mapped, write-through
Instructions
16kB 2-way set associative
Both virtually addressed
External cache
Up to 4MB

43
Virtual Memory

64-bit virtual addresses ? 44-bit physical
addresses
TLB
64 entry, fully-associative cache

44
Multimedia Support (VIS)

Integrated with FPU
Partitioned operations
Multiple smaller values in 64-bits
Video compression instructions
E.g. motion estimation instruction replaces 48
simple instructions for MPEG compression

45
The End!
46
(No Transcript)

Write a Comment

User Comments (0)