Title: CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations Cont Memory Technology
1CS252Graduate Computer ArchitectureLecture
16Cache Optimizations (Cont)Memory Technology
- John Kubiatowicz
- Electrical Engineering and Computer Sciences
- University of California, Berkeley
- http//www.eecs.berkeley.edu/kubitron/cs252
- http//www-inst.eecs.berkeley.edu/cs252
2Review Cache performance
- Miss-oriented Approach to Memory Access
- Separating out Memory component entirely
- AMAT Average Memory Access Time
3Review 6 Basic Cache Optimizations
- Reducing hit time
- Avoiding Address Translation during Cache
Indexing - E.g., Overlap TLB and cache access, Virtual
Addressed Caches - Reducing Miss Penalty
- 2. Giving Reads Priority over Writes
- E.g., Read complete before earlier writes in
write buffer - 3. Multilevel Caches
- Reducing Miss Rate
- 4. Larger Block size (Compulsory misses)
- 5. Larger Cache size (Capacity misses)
- 6. Higher Associativity (Conflict misses)
41. Fast hits by Avoiding Address Translation
- Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs.
Physical Cache - Every time process is switched logically must
flush the cache otherwise get false hits - Cost is time to flush compulsory misses from
empty cache - Dealing with aliases (sometimes called synonyms)
Two different virtual addresses map to same
physical address - I/O must interact with cache, so need virtual
address - Solution to aliases
- HW guaranteess covers index field direct
mapped, they must be uniquecalled page coloring - Solution to cache flush
- Add process identifier tag that identifies
process as well as address within process cant
get a hit if wrong process
5Two options for avoiding translation
CPU
CPU
VA
VA
PA Tags
TB
TB
PA
PA
L2
MEM
PA
MEM
Still Physically Indexed Overlap access with VA
translation requires index to remain
invariant across translation
Physically Addressed(indexed) Conventional Organ
ization
Variation A
Variation B
63. Multi-level cache
- L2 Equations
- AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1 - Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2 - AMAT Hit TimeL1
- Miss RateL1 x (Hit TimeL2 Miss RateL2
Miss PenaltyL2) - Definitions
- Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2) - Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2) - Global Miss Rate is what matters
7Review (Cont)12 Advanced Cache Optimizations
- Reducing hit time
- Small and simple caches
- Way prediction
- Trace caches
- Increasing cache bandwidth
- Pipelined caches
- Multibanked caches
- Nonblocking caches
- Reducing Miss Penalty
- Critical word first
- Merging write buffers
- Reducing Miss Rate
- Victim Cache
- Hardware prefetching
- Compiler prefetching
- Compiler Optimizations
84 Increasing Cache Bandwidth by Pipelining
- Pipeline cache access to maintain bandwidth, but
higher latency - Instruction cache access pipeline stages
- 1 Pentium
- 2 Pentium Pro through Pentium III
- 4 Pentium 4
- ? greater penalty on mispredicted branches
- ? more clock cycles between the issue of the load
and the use of the data
95. Increasing Cache Bandwidth Non-Blocking
Caches
- Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss - requires F/E bits on registers or out-of-order
execution - requires multi-bank memories
- hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests - hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses - Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses - Requires muliple memory banks (otherwise cannot
support) - Penium Pro allows 4 outstanding memory misses
10Value of Hit Under Miss for SPEC (old data)
0-1 1-2 2-64 Base
Hit under n Misses
- FP programs on average AMAT 0.68 - 0.52 -
0.34 - 0.26 - Int programs on average AMAT 0.24 - 0.20 -
0.19 - 0.19 - 8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss, SPEC 92
116 Increasing Cache Bandwidth via Multiple Banks
- Rather than treat the cache as a single
monolithic block, divide into independent banks
that can support simultaneous accesses - E.g.,T1 (Niagara) L2 has 4 banks
- Banking works best when accesses naturally spread
themselves across banks ? mapping of addresses to
banks affects behavior of memory system - Simple mapping that works well is sequential
interleaving - Spread block addresses sequentially across banks
- E,g, if there 4 banks, Bank 0 has all blocks
whose address modulo 4 is 0 bank 1 has all
blocks whose address modulo 4 is 1
127. Reduce Miss Penalty Early Restart and
Critical Word First
- Dont wait for full block before restarting CPU
- Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution - Spatial locality ? tend to want next sequential
word, so not clear size of benefit of just early
restart - Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block - Long blocks more popular today ? Critical Word
1st Widely used
138. Merging Write Buffer to Reduce Miss Penalty
- Write buffer to allow processor to continue while
waiting to write to memory - If buffer contains modified blocks, the addresses
can be checked to see if address of new data
matches the address of a valid write buffer entry
- If so, new data are combined with that entry
- Increases block size of write for write-through
cache of writes to sequential words, bytes since
multiword writes more efficient to memory - The Sun T1 (Niagara) processor, among many
others, uses write merging
149. Reducing Misses a Victim Cache
- How to combine fast hit time of direct mapped
yet still avoid conflict misses? - Add buffer to place data discarded from cache
- Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache - Used in Alpha, HP machines
DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
1510. Reducing Misses by Hardware Prefetching of
Instructions Data
- Prefetching relies on having extra memory
bandwidth that can be used without penalty - Instruction Prefetching
- Typically, CPU fetches 2 blocks on a miss the
requested block and the next consecutive block. - Requested block is placed in instruction cache
when it returns, and prefetched block is placed
into instruction stream buffer - Data Prefetching
- Pentium 4 can prefetch data into L2 cache from up
to 8 streams from 8 different 4 KB pages - Prefetching invoked if 2 successive L2 cache
misses to a page, if distance between those
cache blocks is
1611. Reducing Misses by Software Prefetching Data
- Data Prefetch
- Load data into register (HP PA-RISC loads)
- Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC v. 9) - Special prefetching instructions cannot cause
faultsa form of speculative execution - Issuing Prefetch Instructions takes time
- Is cost of prefetch issues misses?
- Higher superscalar reduces difficulty of issue
bandwidth
1712. Reducing Misses by Compiler Optimizations
- McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software - Instructions
- Reorder procedures in memory so as to reduce
conflict misses - Profiling to look at conflicts(using tools they
developed) - Data
- Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays - Loop Interchange change nesting of loops to
access data in order stored in memory - Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap - Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows
18Merging Arrays Example
- / Before 2 sequential arrays /
- int valSIZE
- int keySIZE
- / After 1 array of stuctures /
- struct merge
- int val
- int key
-
- struct merge merged_arraySIZE
- Reducing conflicts between val key improve
spatial locality
19Loop Interchange Example
- / Before /
- for (k 0 k
- for (j 0 j
- for (i 0 i
- xij 2 xij
- / After /
- for (k 0 k
- for (i 0 i
- for (j 0 j
- xij 2 xij
- Sequential accesses instead of striding through
memory every 100 words improved spatial locality
20Loop Fusion Example
- / Before /
- for (i 0 i
- for (j 0 j
- aij 1/bij cij
- for (i 0 i
- for (j 0 j
- dij aij cij
- / After /
- for (i 0 i
- for (j 0 j
- aij 1/bij cij
- dij aij cij
- 2 misses per access to a c vs. one miss per
access improve spatial locality
21Blocking Example
- / Before /
- for (i 0 i
- for (j 0 j
- r 0
- for (k 0 k
- r r yikzkj
- xij r
-
- Two Inner Loops
- Read all NxN elements of z
- Read N elements of 1 row of y repeatedly
- Write N elements of 1 row of x
- Capacity Misses a function of N Cache Size
- 2N3 N2 (assuming no conflict otherwise )
- Idea compute on BxB submatrix that fits
22Blocking Example
- / After /
- for (jj 0 jj
- for (kk 0 kk
- for (i 0 i
- for (j jj j
- r 0
- for (k kk k
- r r yikzkj
- xij xij r
-
- B called Blocking Factor
- Capacity Misses from 2N3 N2 to 2N3/B N2
- Conflict Misses Too?
23Reducing Conflict Misses by Blocking
- Conflict misses in caches not FA vs. Blocking
size - Lam et al 1991 a blocking factor of 24 had a
fifth the misses vs. 48 despite both fit in cache
24Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
25Compiler Optimization vs. Memory Hierarchy Search
- Compiler tries to figure out memory hierarchy
optimizations - New approach Auto-tuners 1st run variations of
program on computer to find best combinations of
optimizations (blocking, padding, ) and
algorithms, then produce C code to be compiled
for that computer - Auto-tuner targeted to numerical method
- E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity
(Sparse linear algebra), Spiral (DSP), FFT-W
26Sparse Matrix Search for Blocking
for finite element problem Im, Yelick, Vuduc,
2005
27Best Sparse Blocking for 8 Computers
8
4
row block size (r)
2
1
1
2
4
8
column block size (c)
- All possible column block sizes selected for 8
computers How could compiler know?
28(No Transcript)
29AMD Opteron Memory Hierarchy
- 12-stage integer pipeline yields a maximum clock
rate of 2.8 GHz and fastest memory PC3200 DDR
SDRAM - 48-bit virtual and 40-bit physical addresses
- I and D cache 64 KB, 2-way set associative, 64-B
block, LRU - L2 cache 1 MB, 16-way, 64-B block, pseudo LRU
- Data and L2 caches use write back, write allocate
- L1 caches are virtually indexed and physically
tagged - L1 I TLB and L1 D TLB fully associative, 40
entries - 32 entries for 4 KB pages and 8 for 2 MB or 4 MB
pages - L2 I TLB and L1 D TLB 4-way, 512 entities of 4
KB pages - Memory controller allows up to 10 cache misses
- 8 from D cache and 2 from I cache
30Opteron Memory Hierarchy Performance
- For SPEC2000
- I cache misses per instruction is 0.01 to 0.09
- D cache misses per instruction are 1.34 to 1.43
- L2 cache misses per instruction are 0.23 to
0.36 - Commercial benchmark (TPC-C-like)
- I cache misses per instruction is 1.83 (100X!)
- D cache misses per instruction are 1.39 (? same)
- L2 cache misses per instruction are 0.62 (2X to
3X) - How compare to ideal CPI of 0.33?
31CPI breakdown for Integer Programs
- CPI above base attributable to memory ? 50
- L2 cache misses ? 25 overall (50 memory CPI)
- Assumes misses are not overlapped with the
execution pipeline or with each other, so the
pipeline stall portion is a lower bound
32CPI breakdown for Floating Pt. Programs
- CPI above base attributable to memory ? 60
- L2 cache misses ? 40 overall (70 memory CPI)
- Assumes misses are not overlapped with the
execution pipeline or with each other, so the
pipeline stall portion is a lower bound
33Pentium 4 vs. Opteron Memory Hierarchy
Clock rate for this comparison in 2005 faster
versions existed
34Misses Per Instruction Pentium 4 vs. Opteron
3.4X
2.3X
?Opteron better
1.5X
0.5X
?Pentium better
- D cache miss P4 is 2.3X to 3.4X vs. Opteron
- L2 cache miss P4 is 0.5X to 1.5X vs. Opteron
- Note Same ISA, but not same instruction count
35Fallacies and Pitfalls
- Not delivering high memory bandwidth in a
cache-based system - 10 Fastest computers at Stream benchmark
McCalpin 2005 - Only 4/10 computers rely on data caches, and
their memory BW per processor is 7X to 25X slower
than NEC SX7
36Main Memory Background
- Performance of Main Memory
- Latency Cache Miss Penalty
- Access Time time between request and word
arrives - Cycle Time time between requests
- Bandwidth I/O Large Block Miss Penalty (L2)
- Main Memory is DRAM Dynamic Random Access Memory
- Dynamic since needs to be refreshed periodically
(8 ms, 1 time) - Addresses divided into 2 halves (Memory as a 2D
matrix) - RAS or Row Address Strobe
- CAS or Column Address Strobe
- Cache uses SRAM Static Random Access Memory
- No refresh (6 transistors/bit vs. 1
transistorSize DRAM/SRAM 4-8, Cost/Cycle
time SRAM/DRAM 8-16
37Main Memory Deep Background
- Out-of-Core, In-Core, Core Dump?
- Core memory?
- Non-volatile, magnetic
- Lost to 4 Kbit DRAM (today using 512Mbit DRAM)
- Access time 750 ns, cycle time 1500-3000 ns
38Core Memories (1950s 60s)
The first magnetic core memory, from the IBM 405
Alphabetical Accounting Machine.
- Core Memory stored data as magnetization in iron
rings - Iron cores woven into a 2-dimensional mesh of
wires - Origin of the term Dump Core
- Rumor that IBM consulted Life Saver company
- See http//www.columbia.edu/acis/history/core.htm
l
39DRAM logical organization (4 Mbit)
Column Decoder
D
Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
(2,048 x 2,048)
Storage
W
ord Line
Cell
- Square root of bits per RAS/CAS
40Quest for DRAM Performance
- Fast Page mode
- Add timing signals that allow repeated accesses
to row buffer without another row access time - Such a buffer comes naturally, as each array will
buffer 1024 to 2048 bits for each access - Synchronous DRAM (SDRAM)
- Add a clock signal to DRAM interface, so that the
repeated transfers would not bear overhead to
synchronize with DRAM controller - Double Data Rate (DDR SDRAM)
- Transfer data on both the rising edge and falling
edge of the DRAM clock signal ? doubling the peak
data rate - DDR2 lowers power by dropping the voltage from
2.5 to 1.8 volts offers higher clock rates up
to 400 MHz - DDR3 drops to 1.5 volts higher clock rates up
to 800 MHz - Improved Bandwidth, not Latency
41DRAM name based on Peak Chip Transfers / SecDIMM
name based on Peak DIMM MBytes / Sec
42Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address
- Row and Column Address together
- Select 1 bit a time
data
43Review1-T Memory Cell (DRAM)
row select
- Write
- 1. Drive bit line
- 2.. Select row
- Read
- 1. Precharge bit line to Vdd/2
- 2.. Select row
- 3. Cell and bit line share charges
- Very small voltage changes on the bit line
- 4. Sense (fancy sense amp)
- Can detect changes of 1 million electrons
- 5. Write restore the value
- Refresh
- 1. Just do a dummy read to every cell.
bit
44DRAM Capacitors more capacitance in a small area
- Trench capacitors
- Logic ABOVE capacitor
- Gain in surface area of capacitor
- Better Scaling properties
- Better Planarization
- Stacked capacitors
- Logic BELOW capacitor
- Gain in surface area of capacitor
- 2-dim cross-section quite small
45DRAM Read Timing
- Every DRAM access begins at
- The assertion of the RAS_L
- 2 ways to read early or late v. CAS
DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
464 Key DRAM Timing Parameters
- tRAC minimum time from RAS line falling to the
valid data output. - Quoted as the speed of a DRAM when buy
- A typical 4Mb DRAM tRAC 60 ns
- Speed of DRAM since on purchase sheet?
- tRC minimum time from the start of one row
access to the start of the next. - tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns - tCAC minimum time from CAS line falling to valid
data output. - 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
- tPC minimum time from the start of one column
access to the start of the next. - 35 ns for a 4Mbit DRAM with a tRAC of 60 ns
47Main Memory Performance
Cycle Time
Access Time
Time
- DRAM (Read/Write) Cycle Time DRAM
(Read/Write) Access Time - 21 why?
- DRAM (Read/Write) Cycle Time
- How frequent can you initiate an access?
- Analogy A little kid can only ask his father for
money on Saturday - DRAM (Read/Write) Access Time
- How quickly will you get what you want once you
initiate an access? - Analogy As soon as he asks, his father will give
him the money - DRAM Bandwidth Limitation analogy
- What happens if he runs out of money on Wednesday?
48Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
49Main Memory Performance
- Wide
- CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits)
- Interleaved
- CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved
- Simple
- CPU, Cache, Bus, Memory same width (32 bits)
50Main Memory Performance
- Timing model
- 1 to send address,
- 4 for access time, 10 cycle time, 1 to send data
- Cache Block is 4 words
- Simple M.P. 4 x (1101) 48
- Wide M.P. 1 10 1 12
- Interleaved M.P. 1101 3 15
51Avoiding Bank Conflicts
- Lots of banks
- int x256512
- for (j 0 j
- for (i 0 i
- xij 2 xij
- Even with 128 banks, since 512 is multiple of
128, conflict on word accesses - SW loop interchange or declaring array not power
of 2 (array padding) - HW Prime number of banks
- bank number address mod number of banks
- bank number address mod number of banks
- address within bank ?address / number of words
in bank - modulo divide per memory access with prime no.
banks?
52Finding Bank Number and Address within a bank
- Problem We want to determine the number of
banks, Nb, to use - and the number of words to store in each bank,
Wb, such that - given a word address x, it is easy to find the
bank where x will - be found, B(x), and the address of x within the
bank, A(x). - for any address x, B(x) and A(x) are unique.
- the number of bank conflicts is minimized
53Finding Bank Number and Address within a bank
Solution We will use the following relation to
determine the bank number for x, B(x), and the
address of x within the bank, A(x) B(x) x
MOD Nb A(x) x MOD Wb and we will choose Nb
and Wb to be co-prime, i.e., there is no
prime number that is a factor of Nb and Wb (this
condition is satisfied if we choose Nb to be a
prime number that is equal to an integer power of
two minus 1). We can then use the Chinese
Remainder Theorem to show that B(x) and A(x) is
always unique.
54Fast Bank Number
- Chinese Remainder Theorem As long as two sets of
integers ai and bi follow these rules - and that ai and aj are co-prime if i ? j, then
the integer x has only one solution (unambiguous
mapping) - bank number b0, number of banks a0
- address within bank b1, number of words in bank
a1 - N word address 0 to N-1, prime no. banks, words
power of 2 - 3 banks Nb 3, and 8 words per bank, Wb 8.
Seq. Interleaved Modulo
Interleaved Bank Number 0 1 2 0 1 2 Address
within Bank 0 0 1 2 0 16 8 1 3 4 5
9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13
14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7
21 22 23 15 7 23
55Fast Memory Systems DRAM specific
- Multiple CAS accesses several names (page mode)
- Extended Data Out (EDO) 30 faster in page mode
- New DRAMs to address gap what will they cost,
will they survive? - RAMBUS startup company reinvent DRAM interface
- Each Chip a module vs. slice of memory
- Short bus between CPU and chips
- Does own refresh
- Variable amount of data returned
- 1 byte / 2 ns (500 MB/s per chip)
- Synchronous DRAM 2 banks on chip, a clock signal
to DRAM, transfer synchronous to system clock (66
- 150 MHz) - Intel claims RAMBUS Direct (16 b wide) is future
PC memory - Niche memory or main memory?
- e.g., Video RAM for frame buffers, DRAM fast
serial output
56Fast Page Mode Operation
- Regular DRAM Organization
- N rows x N column x M-bit
- Read Write M-bit at a time
- Each M-bit access requiresa RAS / CAS cycle
- Fast Page Mode DRAM
- N x M SRAM to save a row
- After a row is read into the register
- Only CAS is needed to access other M-bit blocks
on that row - RAS_L remains asserted while CAS_L is toggled
Column Address
DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
57Something new Structure of Tunneling Magnetic
Junction
- Tunneling Magnetic Junction RAM (TMJ-RAM)
- Speed of SRAM, density of DRAM, non-volatile (no
refresh) - Spintronics combination quantum spin and
electronics - Same technology used in high-density disk-drives
58MEMS-based Storage
- Magnetic sled floats on array of read/write
heads - Approx 250 Gbit/in2
- Data ratesIBM 250 MB/s w 1000 headsCMU 3.1
MB/s w 400 heads - Electrostatic actuators move media around to
align it with heads - Sweep sled 50?m in
- Capacity estimated to be in the 1-10GB in 10cm2
See Ganger et all http//www.lcs.ece.cmu.edu/rese
arch/MEMS
59Big storage (such as DRAM/DISK)Potential for
Errors!
- Motivation
- DRAM is dense ?Signals are easily disturbed
- High Capacity ? higher probability of failure
- Approach Redundancy
- Add extra information so that we can recover from
errors - Can we do better than just create complete
copies? - Block Codes Data Coded in blocks
- k data bits coded into n encoded bits
- Measure of overhead Rate of Code K/N
- Often called an (n,k) code
- Consider data as vectors in GF(2) i.e. vectors
of bits - Code Space is set of all 2n vectors, Data space
set of 2k vectors - Encoding function Cf(d)
- Decoding function df(C)
- Not all possible code vectors, C, are valid!
60Need for Error Correction!
- Motivation
- Failures/time proportional to number of bits!
- As DRAM cells shrink, more vulnerable
- Went through period in which failure rate was low
enough without error correction that people
didnt do correction - DRAM banks too large now
- Servers always corrected memory systems
- Basic idea add redundancy through parity bits
- Common configuration Random error correction
- SEC-DED (single error correct, double error
detect) - One example 64 data bits 8 parity bits (11
overhead) - Really want to handle failures of physical
components as well - Organization is multiple DRAMs/DIMM, multiple
DIMMs - Want to recover from failed DRAM and failed DIMM!
- Chip kill handle failures width of single DRAM
chip
61General IdeaCode Vector Space
- Not every vector in the code space is valid
- Hamming Distance (d)
- Minimum number of bit flips to turn one code word
into another - Number of errors that we can detect (d-1)
- Number of errors that we can fix ½(d-1)
62Conclusion
- Memory wall inspires optimizations since so much
performance lost there - Reducing hit time Small and simple caches, Way
prediction, Trace caches - Increasing cache bandwidth Pipelined caches,
Multibanked caches, Nonblocking caches - Reducing Miss Penalty Critical word first,
Merging write buffers - Reducing Miss Rate Compiler optimizations
- Reducing miss penalty or miss rate via
parallelism Hardware prefetching, Compiler
prefetching - Auto-tuners search replacing static compilation
to explore optimization space?