1 (No Transcript) 2 Improving Cache Performance AMAT Hit time Miss rate Miss penalty
Four categories of optimisation
Reduce miss rate
Reduce miss penalty
Reduce miss rate or miss penalty using parallelism
Reduce hit time
3 5.5. Reducing Miss Rate
Three sources of misses
Compulsory
cold start misses
Capacity
Cache is full
Conflict
Set is full/block is occupied
Increase block size Increase size of cache Increase degree of associativity 4 Larger Block Size
Bigger blocks reduce compulsory misses
Spatial locality
BUT
Increased miss penalty
More data to transfer
Possibly increased overall miss rate
More conflict and capacity misses as there are fewer blocks
5 Effect of Block Size 6 Larger Caches
Reduces capacity misses
Increases hit time and cost
7 Higher Associativity
Miss rates improve with higher associativity
Two rules of thumb
8-way set associative caches are almost as effective as fully associative
But much simpler!
21 cache rule
A direct mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2
8 Way Prediction
Set-associative cache predicts which block will be needed on next access to the set
Only one tag check is done
If mispredicted the whole set must be checked
E.g. Alpha 21264 instruction cache
Prediction rate gt 85
Correct prediction 1 cycle hit
Misprediction 3 cycles
9 Pseudo-Associative Caches
Check a direct mapped cache for a hit as usual
If it misses, check a second block
Invert MSB of index
One fast and one slow hit time
10 Compiler Optimisations
Compilers can optimise code to minimise miss rates
Reordering procedures
Aligning basic blocks with cache blocks
Reorganising array element accesses
11 5.6. Reduce Miss Rate or Miss Penalty via Parallelism
Three techniques that overlap instruction execution with memory access
12 Nonblocking caches
Dynamic scheduling allows CPU to continue with other instructions while waiting for data
Nonblocking cache allows other cache accesses to continue while waiting for data
13 Hardware Prefetching
Fetch data/instructions before they are requested by the processor
Either into cache or another buffer
Particularly useful for instructions
High degree of spatial locality
UltraSPARC III
Special prefetch cache for data
Increases effectiveness by about four times
14 Compiler Prefetching
Compiler inserts prefetch instructions
Two types
Prefetch register value
Prefetch data cache block
Can be faulting or non-faulting
Cache continues as normal while data is prefetched
15 SPARC V9
Prefetch
prefetch rs1 rs2, fcn prefetch rs1 imm13, fcn fcn Prefetch function 0 Prefetch for several reads 1 Prefetch for one read 2 Prefetch for several writes 3 Prefetch for one write 4 Prefetch page 16 5.7. Reducing Hit Time
Critical
Often affects CPU clock cycle time
17 Small, simple caches
Small usually equals fast in hardware
A small cache may reside on the processor chip
Decreases communication
Compromise tags on chip, data separate
Direct mapped
Data can be read in parallel with tag checking
18 Avoiding address translation
Physical caches
Use physical addresses
Address translation must happen before cache lookup
Virtual caches
Use virtual addresses
Protection issues
High context switching overhead
19 Virtual caches
Minimising context switch overhead
Add process-identifier tag to cache
Multiple virtual addresses may refer to a single physical address
Hardware enforces anti-aliasing
Software requires less significant bits to be the same
20 Avoiding address translation (cont.)
Choice of page size
Bigger than cache index offset
Address translation and tag lookup can happen in parallel
21 Pipelining cache access
Split cache access into several stages
Impacts on branch and load delays
22 Trace caches
Blocks follow program flow rather than spatial locality!
Branch prediction is taken into account by cache
Intel NetBurst microarchitecture
Complicates address mapping
Minimises wasted space within blocks
23 Cache OptimisationSummary
Cache optimisation is very complex
Improving one factor may have a negative impact on another
24 5.6. Main Memory
Latency and bandwidth are both important
Latency is composed of two factors
Access time
Cycle time
Two main technologies
DRAM
SRAM
25 5.7. Virtual Memory
Physical memory is divided into blocks
Allocated to processes
Provides protection
Allows swapping to disk
Simplifies loading
Historically
Overlays
Programmer controlled swapping
26 Terminology
Block
Page
Segment
Miss
Page fault
Address fault
Memory mapping (address translation)
Virtual address ? physical address
27 Characteristics
Block size
4kB 64kB
Hit time
50 150 cycles
Miss penalty
1 000 000 10 000 000 cycles
Miss Rate
0.000 01 0.001
? ? 28 Categorising VM Systems
Fixed block size
Pages
Variable block size
Segments
Difficult replacement
Hybrid approaches
Paged segments
Multiple page sizes (2n smallest)
29 Q1 Block placement?
Anywhere in memory
Fully associative
Minimises miss rate
30 Q2 Block identification?
Page/segment number gives physical page address
Paging offset concatenated
Segments offset added
Uses a page table
Number of pages in virtual address space
Save space inverted page table
Number of pages in physical memory
31 Q3 Block replacement?
Least-recently used (LRU)
Minimises miss rate
Hardware provides a use bit or reference bit
32 Q4 Write strategy?
Write back
With a dirty bit
You wont become famous by being the first to try write through! 33 Fast Address Translation
Page tables are big
Stored in memory themselves
Two memory accesses for every datum!
Principle of locality
Cache recent translations
Translation look-aside buffer (TLB), or translation buffer (TB)
34 Alpha 21264 TLB 35 Selecting a Page Size
Big
Smaller page table
Allows parallel cache access
Efficient disk transfers
Reduces TLB misses
Small
Less memory wastage (internal fragmentation)
Quicker process startup
36 Putting it ALL Together!
SPARC Revisited
37 Two SPARCs
SuperSPARC
1992
32-bit superscalar design
UltraSPARC
Late 1990s
64-bit design
Graphics support (VIS)
38 UltraSPARC
Four-way superscalar execution
Two integer ALUs
FP unit
Five functional units
Graphics unit
39 Pipeline
9 stages
Fetch
Decode
Grouping
Execution
Cache access
Load miss
Integer pipe wait (for FP/graphics pipelines)
Trap resolution
Writeback
40 Branch Handling
Dynamic branch prediction
Two bit scheme
Every second instruction in cache has prediction bits (predicts up to 2048 branches)
88 success rate (integer)
Target prediction
Fetches from predicted path
41 FPU
Five functional units
Add
Multiply
Divide/square root
Two graphics units (add and multiply)
Mostly fully pipelined (latency 3 cycles)
Except divide and square root (not pipelined, latency is 22 cycles for 64-bit)
PowerShow.com is a leading presentation sharing website. It has millions of presentations already uploaded and available with 1,000s more being uploaded by its users every day. Whatever your area of interest, here you’ll be able to find and view presentations you’ll love and possibly download. And, best of all, it is completely free and easy to use.
You might even have a presentation you’d like to share with others. If so, just upload it to PowerShow.com. We’ll convert it to an HTML5 slideshow that includes all the media types you’ve already added: audio, video, music, pictures, animations and transition effects. Then you can share it with your target audience as well as PowerShow.com’s millions of monthly visitors. And, again, it’s all free.
About the Developers
PowerShow.com is brought to you by CrystalGraphics, the award-winning developer and market-leading publisher of rich-media enhancement products for presentations. Our product offerings include millions of PowerPoint templates, diagrams, animated 3D characters and more.