Title: Memory Hierarchy Design
1Memory Hierarchy Design
- Motivated by a combination of programmer's desire
for unlimited fast memory and economical
considerations, and based on - principle of locality, and
- cost/performance ratio of memory technologies
(fast ?? small, large ?? slow), - to achieve a memory system with cost almost
as low as the cheapest level of memory and speed
almost as fast as the fastest level. - Hierarchy CPU/register file (RF) ? cache (C) ?
main memory (MM) ? disk memory/I/O devices (DM) - Speed in descending order RF gt C gt MM gt DM
- Space in ascending order RF lt C lt MM lt DM
2Memory Hierarchy Design
- The gaps in speed and space between the different
levels are widening increasingly
Level/Name 1/RF 2/C 3/MM 4/DM
Typical size lt 1 KB lt 16 MB lt 16 GB gt 100GB
Implementation technology Custom memory w. multiple ports, CMOS On-chip or off-chip CMOS SRAM CMOS DRAM Magnetic disk
Access time (ns) 0.25-0.5 0.5-25 80-250 5,000,000
Bandwidth 20,000-100,000 (MB/s) 5000-10,000 (MB/s) 1000-5000 (MB/s) 20-150 (MB/s)
Managed by Compiler Hardware Operating system OS/operator
Backed by Cache Main memory Disk CD or tape
3Memory Hierarchy Design
- Cache performance review
- Memory stall cycles Number_of_misses
Miss_penalty - IC
Miss_per_instr Miss_penalty - IC
MAPI Miss_rate Miss_penalty - where MAPI stands for memory accesses
per instruction - Four Fundamental Memory Hierarchy Design Issues
- Block placement issue where can a block, the
atomic memory unit in cache-memory transactions,
be placed in the upper level? - Block identification issue how is a block found
if it is in the upper level? - Block replacement issue which block should be
replaced on a miss? - Write strategy issue what happens on a write?
4Memory Hierarchy Design
- Placement three approaches
- fully associative any block in the main memory
can be placed in any block frame. It is flexible
but expensive due to associativity - direct mapping each block in memory is placed in
a fixed block frame with the following mapping
function (Block Address) MOD (Number of blocks
in cache) - set associative a compromise between fully
associative and direct mapping The cache is
divided into sets of block frames, and each block
from the memory is first mapped to a fixed set
wherein the block can be placed in any block
frame. Mapping to a set follows the function,
called a bit selection - (Block Address) MOD
(Number of sets in cache)
5Memory Hierarchy Design
- Identification
- Each block frame in the cache has an address tag
indicating the block's address in the memory - All possible tags are searched in parallel
- A valid bit is attached to the tag to indicate
whether the block contains valid information or
not - An address for a datum from CPU, A, is divided
into a block address field and a block offset
field - block address (A) / (block size)
- block offset (A) MOD (block size)
- block address is further divided into tag and
index - index indicates the set in which the block may
reside - tag is compared to indicate a hit or a miss
6Memory Hierarchy Design
- Replacement on a cache miss
- The more choices for replacement, the more
expensive for hardware ? direct mapping is the
simplest - Random vs. least-recently used (LRU) the former
has uniform allocation and is simple to build
while the latter can take advantage of temporal
locality but can be expensive to implement
(why?). First in, first out (FIFO) approximates
LRU and is simpler than LRU - Data cache misses per 1000
instructions
Associativity Associativity Associativity Associativity Associativity Associativity Associativity Associativity Associativity
Two-way Two-way Two-way Four-way Four-way Four-way Eight-way Eight-way Eight-way
Size LUR Random FIFO LUR Random FIFO LUR Random FIFO
16K 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
64K 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
256K 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
7Memory Hierarchy Design
- Write strategies
- Most cache accesses are reads 10 stores 37
loads 100 instructions ? only 7 of all memory
accesses are writes - Optimize reads to make the common case fast,
observing that CPU doesn't have to wait for
writes while must wait for reads fortunately,
read is easy in direct-mapping reading and tag
comparison can be done in parallel (what about
associative mapping?) but write is hard - Cannot overlap tag reading and block writing
(destructive) - CPU specifies write size only 1 - 8 bytes. Thus
write strategies often distinguish cache design
On a write hit - write through (or store through)
- ensuring consistency at the cost of memory and
bus bandwidth - write stalls may be alleviated by using write
buffers - write back (store in)
- minimizing memory and bus traffic at the cost of
weakened consistency, - use dirty bit to indicate modification
- read misses may result in writes (why?)
- On a write miss
- write allocate (fetch on write)
- no-write allocate (write around)
8Memory Hierarchy Design
- An ExampleThe Alpha 21264 Data Cache
- Cache size64KB, block size64B, two-way set
associativity, write-back, write allocate on a
write miss. - What is the index size?
-
64K/(642) 216/(261)29
9Memory Hierarchy Design
- Cache Performance
- Memory access time is an indirect measure of
performance and it is not a substitute for
execution time -
10Memory Hierarchy Design
- Example 1 How much does cache help in
performance? -
11Memory Hierarchy Design
- Example 2 Whats the relationship between AMAT
and CPU Time? -
12Memory Hierarchy Design
- Improving Cache Performance
- The average memory access time can be improved by
reducing any of the three parameters above - R1 ? reducing miss rate
- R2 ? reducing miss penalty
- R3 ? reducing hit time
- Four categories of cache organizations that help
reduce these parameters - Organizations that help reduce miss rate
- larger block size, larger cache size, higher
associativity, way prediction and
pseudoassociativity, and compiler optimization - Organizations that help reduce miss penalty
- multilevel caches, critical word first, read
miss before write miss, merging write buffers,
and victim cache - Organizations that help reduce miss penalty or
miss rate via parallelism - non-blocking caches, hardware prefetching, and
compiler prefetching - Organizations that help reduce hit time
- small and simple caches, avoid address
translation, pipelined cache access, and trace
cache.
13Memory Hierarchy Design
- Reducing Miss Rate
- There are three kinds of cache misses depending
on the causes - Compulsory the very first access to a block
cannot be a hit, since the block must be first
brought in from the main memory. Also call
cold-start misses - Capacity lack of space in cache to hold all
blocks needed for the execution. Capacity misses
will occur because of blocks being discarded and
later retrieved - Conflict due to mapping that confines blocks to
restricted area of cache (e.g., direct mapping,
set-associative), also called collision misses or
interference misses - While 3-C characterization gives insights to
causes, they are at times too simplistic (and
they are inter-dependent). For example, they
ignore replacement policies.
14Memory Hierarchy Design
15Memory Hierarchy Design
16Memory Hierarchy Design
- First Miss Rate Reduction Technique Large Block
Size - Takes advantage of spatial locality ? reduces
compulsory miss - Increases miss penalty (it takes longer to fetch
a block) - Increases conflict misses, and/or increases
capacity misses - Must strike a delicate balance among MP, MR, and
AMAT, in finding an appropriate block size
17Memory Hierarchy Design
- First Miss Rate Reduction Technique Larger Block
Size - Example Find the optimal block size in terms of
AMAT, given that miss penalty is 40 cycles
overhead plus 2 cycles/16 bytes and miss rates of
the table below. - Solution AMAT HT MR MP HT MR (40
block size 2 / 16) - High latency and bandwidth encourages large block
size - Low latency and bandwidth encourages small block
size
18Memory Hierarchy Design
- Second Miss Rate Reduction Technique Larger
Caches - An obvious way to reduce capacity misses.
- Drawback high overhead in terms of hit time and
higher cost. - Popular in off-chip cache (2nd and 3rd level
cache). - Third Miss Rate Reduction Technique Higher
Associativity - Miss rate
- Rule of Thumb
- 8-way associativity is almost equal to full
associativity - Miss rate of (1-way of N-sized cache) is almost
equal to Miss rate of (2-way of 0.5N-sized cache) - The higher the associativity, the longer the hit
time (why?) - Higher miss rate rewards higher associativity.
19Memory Hierarchy Design
- Fourth Miss Rate Reduction Technique Way
Prediction and Pseudoassociative Caches - Way prediction helps select one block among those
in a set, thus requiring only one tag comparison
(if hit). - Preserves advantages of direct-mapping (why?)
- In case of a miss, other block(s) are checked.
- Pseudoassociative (also called column
associative) caches - Operate exactly as direct-mapping caches when
hit, thus again preserving advantages of the
direct-mapping - In case of a miss, another block is checked (as
if in set-associative caches), by simply
inverting the most significant bit of the index
field to find the other block in the pseudoset. - real hit time lt pseudo-hit time
- too many pseudo hits would defeat the purpose
20Memory Hierarchy Design
- Fifth Miss Rate Reduction Technique Compiler
Optimizations
21Memory Hierarchy Design
- Fifth Miss Rate Reduction Technique Compiler
Optimizations
22Memory Hierarchy Design
- Fifth Miss Rate Reduction Technique Compiler
Optimizations
23Memory Hierarchy Design
- Fifth Miss Rate Reduction Technique Compiler
Optimizations - Blocking improve temporal and spatial locality
- multiple arrays are accessed in both ways (i.e.,
row-major and column-major), namely, orthogonal
accesses that can not be helped by earlier
methods - concentrate on submatrices, or blocks
- All NN elements of Y and Z are accessed N times
and each element of X is accessed once. Thus,
there are N3 operations and 2N3 N2 reads!
Capacity misses are a function of N and cache
size in this case.
24Memory Hierarchy Design
- Fifth Miss Rate Reduction Technique Compiler
Optimizations - To ensure that elements being accessed can fit in
the cache, the original code is changed to
compute a submatrix of size BB, where B is
called the blocking factor. - To total number of memory words accessed is
2N3//B N2 - Blocking exploits a combination of spatial (Y)
and temporal (Z) locality.
25Memory Hierarchy Design
- First Miss Penalty Reduction Technique
Multilevel Caches - To keep up with the widening gap between CPU and
main memory, try to - make cache faster, and
- make cache larger
- by adding another, larger but slower
cache between cache and the main memory. -
26Memory Hierarchy Design
- First Miss Penalty Reduction Technique
Multilevel Caches - Local miss rate vs. global miss rate
- Local miss rate is defined as
- Global miss rate is defined as
27Memory Hierarchy Design
- Second Miss Penalty Reduction Technique Critical
Word First and Early Restart - CPU needs just one word of the block at a time
- critical word first fetch the required word
first, and - early start as soon as the required word
arrives, send it to CPU. - Third Miss Penalty Reduction Technique Giving
Priority to Read Misses over Write Misses - Serves reads before writes have been completed
- while write buffers improve write-through
performance, they complicate memory accesses by
potentially delaying updates to memory - instead of waiting for the write buffer to become
empty before processing a read miss, the write
buffer is checked for content that might satisfy
the missing read. - in a write-back scheme, the dirty copy upon
replacing is first written to the write buffer
instead of the memory, thus improving
performance.
28Memory Hierarchy Design
- Fourth Miss Penalty Reduction Technique Merging
Write Buffer - Improves efficiency of write buffers that are
used by both write-through and write back caches - Multiple single-word writes are combined into a
single write buffer entry which is otherwise used
for multi-word write. - Reduces stalls due to write buffer being full
29Memory Hierarchy Design
- Fifth Miss Penalty Reduction Technique Victim
Cache - victim caches attempt to avoid miss penalty on a
miss by - Adding a small fully-associative cache that is
used to contain discarded blocks (victims) - It is proven to be effective, especially for
small 1-way cache. e.g., a 4-entry victim cache
removes 20 !
30Memory Hierarchy Design
- Reducing Cache Miss Penalty or Miss Rate via
Parallelism - Nonblocking Caches (Lock-free caches)
- Hardware Prefetching of Instructions and Data
31Memory Hierarchy Design
- Reducing Cache Miss Penalty or Miss Rate via
Parallelism - Compiler-Controlled Prefetching compiler inserts
prefetch instructions
32Memory Hierarchy Design
- Reducing Cache Miss Penalty or Miss Rate via
Parallelism - Compiler-Controlled Prefetching An Example
- for(i0 ilt3 ii1)
- for(j0 jlt100 jj1)
- aij bj0 bj10
- 16-byte blocks, 8KB cache, 1-way write back,
8-byte elements What kind of locality, if any,
exists for a and b? - 3 rows and 100 columns spatial locality
even-indexed elements miss and odd-indexed
elements hit, leading to 3100/2 150 misses - 101 rows and 3 columns no spatial locality, but
there is temporal locality same element is used
in ith and (i 1)st iterations and the same
element is access in each i iteration. 100 misses
for i 0 and 1 miss for j 0 for a total of 101
misses - Assuming large penalty (50 cycles and at least 7
iterations must be prefetched). Splitting the
loop into two, we have
33Memory Hierarchy Design
- Reducing Cache Miss Penalty or Miss Rate via
Parallelism - Compiler-Controlled Prefetching An Example
(continued) - for(j0 jlt100 jj1)
- prefetch(bj70
- prefetch(a0j7
- a0j bj0 bj10
- for(i1 ilt3 ii1)
- for(j0 jlt100 jj1)
- prefetch(aij7
- aij bj0 bj10
- Assuming that each iteration of the pre-split
loop consumes 7 cycles and no conflict and
capacity misses, then it consumes a total of
7300 25150 14650 cycles (total iteration
cycles plus total cache miss cycles) whereas the
split loop consumes a total of (117)100(47)5
0(17)200(44)50 3450
34Memory Hierarchy Design
- Reducing Cache Miss Penalty or Miss Rate via
Parallelism - Compiler-Controlled Prefetching An Example
(continued) - the first loop consumes 9 cycles per iteration
(due to the two prefetch instruction) - the second loop consumes 8 cycles per iteration
(due to the single prefetch instruction), - during the first 7 iterations of the first loop
array a incurs 4 cache misses, - array b incurs 7 cache misses,
- during the first 7 iterations of the second loop
for i 1 and i 2 array a incurs 4 cache misses
each - array b does not incur any cache miss in the
second split!.
35Memory Hierarchy Design
- First Hit Time Reduction Technique Small and
simple caches - smaller is faster
- small index, less address translation time
- small cache can fit on the same chip
- low associativity in addition to a
simpler/shorter tag check, 1-way cache allows
overlapping tag check with transmission of data
which is not possible with any higher
associativity! - Second Hit Time Reduction Technique Avoid
address translation during indexing - Make the common case fast
- use virtual address for cache because most memory
accesses (more than 90) take place in cache,
resulting in virtual cache
36Memory Hierarchy Design
- Second Hit Time Reduction Technique Avoid
address translation during indexing - Make the common case fast
- there are at least three important performance
aspects that directly relate to
virtual-to-physical translation - improperly organized or insufficiently sized TLBs
may create excess not-in-TLB faults, adding time
to program execution time - for a physical cache, the TLB access time must
occur before the cache access, extending the
cache access time - two-line address (e.g., an I-line and a D-line
address) may be independent of each other in
virtual address space yet collide in the real
address space, when they draw pages whose lower
page address bits (and upper cache address bits)
are identical - problems with virtual cache
- Page-level protection must be enforced no matter
what during address translation (solution copy
protection info from TLB on a miss and hold it in
a field for future virtual indexing/tagging) - when a process is switched in/out, the entire
cache has to be flushed out cause physical
address will be different each time, i.e., the
problem of context switching (solution process
identifier tag -- PID)
37Memory Hierarchy Design
- Second Hit Time Reduction Technique Avoid
address translation during indexing - problems with virtual cache
- different virtual addresses may refer to the same
physical address, i.e., the problem of
synonyms/aliases - HW solution guarantee every cache block a unique
phy. Address - SW solution force aliases to share some address
bits (e.g., page-coloring) - Virtually indexed and physically tagged
- Third Hit Time Reduction Technique Pipelined
cache writes - the solution is to reduce CCT and increase of
stages increases instr. throughput - Fourth Hit Time Reduction Technique Trace caches
- Finds a dynamic sequence of instructions
including taken branches to load into a cache
block - Put traces of the executed instructions into
cache blocks as determined by the CPU - Branch prediction is folded in to the cache and
must be validated along with the addresses to
have a valid fetch. - Disadvantage store the same instructions
multiple times
38Memory Hierarchy Design
- Main Memory and Organizations for Improving
Performance
39Memory Hierarchy Design
- Main Memory and Organizations for Improving
Performance - Wider main memory bus
- Cache miss penalty decreases proportionally
- Cost
- wider bus (x n) and multiplexer (x n),
- expandability (x n), and
- error correction is more expensive
- Simple interleaved memory
- Potential parallelism with multiple DRAMs
- Sending address and accessing multiple bands in
parallel but transmitting data sequentially
(4244x444 cycles ? 16/44 0.4 byte/cycle) - Independent memory banks
40Memory Hierarchy Design
- Main Memory and Organizations for Improving
Performance
41Memory Hierarchy Design
- Main Memory and Organizations for Improving
Performance
42Memory Hierarchy Design
43Memory Hierarchy Design
44Memory Hierarchy Design
45Memory Hierarchy Design
- Virtual Memory
- Fast address translation an example Alpha
21264 data TLB
- ASN is used as PID for virtual caches
- TLB is not flushed on a context switch but only
when ASNs are recycled - Fully associative placement
46Memory Hierarchy Design
- Virtual Memory
- What is the optimal page size? It depends
- page table size 1/page size
- large page size makes virtual cache possible
(avoiding the aliases problem), thus reducing
cache hit time - transfer of larger pages (over the network) is
more efficient efficiency of transfer - small TLB favors larger pages
- main drawback for large page size
- internal fragmentation waste of storage
- process startup time large context switching
overhead
47Memory Hierarchy Design
- Summarizing Virtual Memory Caches
- A hypothetical memory hierarchy going from
virtual address to L2 cache access
48Memory Hierarchy Design
- Protection and Examples of Virtual Memory
- The invention of multiprogramming led to the need
to share computer resources such as CPU, memory,
I/O, etc. by multiple programs whose
instantiations are called processes - Time-sharing of computer resources by multiple
processes requires that processes take turns
using such resources and designers of operating
systems and computer must ensure that the
switching among different processes, also called
context switching is done correctly - The computer designer must ensure that the CPU
portion of the process state can be saved and
restored - The operating systems designer must guarantee
that processes do not interfere with each others
computations - Protecting processes
- Base and Bound each process falls in a
pre-defined portion of the address space, that
is, an address is valid if Base ? Address ?
Bound, where OS keeps and defines the values of
Base and Bound in two registers.
49Memory Hierarchy Design
- Protection and Examples of Virtual Memory
- The computer designers responsibilities in
helping the OS designer protect processes from
each other - Providing two modes to distinguish a user process
from a kernel process (or equivalently,
supervisor or executive process) - Providing a portion of the CPU state, including
the base/bound registers, the user/kernel mode
bit(s), and the exception enable/disable bit,
that a user can use but cannot write and - Providing mechanisms by which the CPU can switch
between the user mode to the supervisor mode. - While base-and-bound constitutes the minimum
protection system, virtual memory offers a more
fine-grained alternative to this simple model - Address translation provides an opportunity to
check any possible violations the read/write
and user/kernel signals from CPU vs. the
permission flags marked on individual pages by
virtual memory (or OS) to detect stray memory
accesses - Depending on the designers apprehension,
protection can be either relaxed or escalated. In
escalated protection, multiple levels of access
permissions can be used, much like the military
classification system.
50Memory Hierarchy Design
- Protection and Examples of Virtual Memory
- A Paged Virtual Memory Example The Alpha Memory
Management and the 21264 TLB (one for Instruction
and one for Data) - A combination of segmentation and paging, with
48-bit virtual addresses while the 64-bit address
space being divided into three segments seg0
(bits 63-47 0..00), kseg (bits 63-46 010),
and seg1 (bits 63-46 111) - Advantages segmentation divides address space
and conserves page table space, while paging
provides virtual memory, relocation, and
protection - Even with segmentation, the size of the page
tables for the 64-bit address space is alarming.
A three-level hierarchical page table is used in
Alpha, with each PT contained in one page
51Memory Hierarchy Design
- Protection and Examples of Virtual Memory
- A Paged Virtual Memory Example The Alpha Memory
Management and the 21264 TLB - PTE is 64 bits long, with the first 32 bits
contain the physical page number and the other
half includes the following five protection
fields - Valid whether the page number is valid for
address translation - User read enable allows user programs to read
data within this page - Kernel read enable -- allows kernel programs to
read data within this page - User write enable allows user programs to write
data within this page - Kernel write enable -- allows kernel programs to
write data within this page - Current design of Alpha has 8-KB pages, thus
allowing 1024 PTEs in each PT. The three page
level fields and page offset account for
1010101343 bits of the 64-bit virtual
address. The 21 bits to the left of level-1 field
are all 0s for seg0 and all 1s for seg1. - The maximum virtual address and physical address
is tied to the page size and Alpha has provisions
for future growth 16KB, 32KB and 64KB page sizes
for the future. - The following table shows memory hierarchy
parameters of the Alpha 21264 TLB
Parameter Description
Block size 1 PTE (8 bytes)
Hit time 1 clock cycle
Miss penalty (average) 20 clock cycle
TLB size 128 PTEs per TLB, each of which can map 1,8,64, or 512 pages
Block selection Round-robin
Write strategy (not applicable)
Block placement Fully associative
52Memory Hierarchy Design
- Protection and Examples of Virtual Memory
- A Segmented Virtual Memory Example Protection
in the Intel Pentium - Pentium has four protection levels, with the
innermost level (0) corresponding to Alphas
kernel mode and the outermost level (3)
corresponding to Alphas user mode. Separate
stacks are used for each level to avoid security
breaches between levels. - User can call an OS routine and pass parameters
to it while retaining full protection - Allows the OS to maintain the protection level of
the called routine for the parameters that are
passed to it - The potential loophole in protection is prevented
by not allowing the user process to ask the OS to
access something indirectly that it would not
have been able to access itself (such security
loopholes are called Trojan Horse). - Bounds checking and memory mapping in Pentium by
the use of a descriptor table (DT, which plays
the role of PTs in the Alpha). The equivalent of
PTE in DT is a segment descriptor containing the
following fields - Present bit equivalent to the PTE valid bit,
used to indicate this is a valid translation - Base field equivalent to a page frame address,
containing the physical address of the first byte
of the segment - Access bit like the reference bit or use bit in
some architectures that is helpful for
replacement algorithms - Attributes field specifies the valid operations
and protection levels for the operations that use
this segment - Limit field not found in paged systems,
establishes the upper bound of valid offsets for
this segment.
53Memory Hierarchy Design
- Crosscutting Issues The Design of Memory
Hierarchies - Superscalar CPU and Number of Ports to the Cache
- Cache must provide sufficient peak bandwidth to
benefit from multiple issues. Some processors
increase complexity of instruction fetch by
allowing instructions to be issued to be found on
any boundary instead of, say, multiples of 4
words. - Speculative Execution and the Memory System
- Speculative and conditional instructions generate
exceptions (by generating invalid addresses) that
would otherwise not occur, which in turn can
overwhelm the benefits of speculation with the
exception handling overhead. Such CPUs must be
matched with non-blocking caches and only
speculate on L1 misses (due to the unbearable
penalty of L2). - Combining Instruction Cache with Instruction
Fetch and Decode Mechanisms - Increasing demand for ILP and clock rate has led
to the merging of the first part of instruction
execution with instruction cache, by
incorporating trace cache (which combines branch
prediction with instruction fetch) and storing
the internal RISC operations in the trace cache
(e.g., Pentium 4s NetBurst microarchitecture). A
cache hit in the merged cache saves portion of
the instruction execution cycles. - Embedded Computer Caches and Real-Time
Performance - In real-time applications, variation of
performance matters much more than average
performance. Thus, caches that offer average
performance enhancement have to be used
carefully. Instruction caches are often used due
to the highly predictability of instructions
whereas data caches are locked down, forcing
them to act as small scratchpad memory under
program control. - Embedded Computer Caches and Power
- It is much more power efficient to access on-chip
memory than to access off-chip one (which needs
to drive the pins, buses and activate external
memory chips, etc). Other techniques, such as way
prediction, can be used to save power (by only
powering half of the two-way set-associative
cache). - I/O and Consistency of Cached Data
- Cache coherence problem must be addressed when
I/O devices also share the same cached data.
54Memory Hierarchy Design
- An Example of the Cache Coherence Problem
55Memory Hierarchy Design
- Putting It All Together Alpha 21264 Memory
Hierarchy
- Instruction cache is virtually indexed and
virtual tagged Data cache is virtually indexed
but physically tagged - Operations
- Chip loads instr serially from an external PROM
and loads configuration info for L2 cache - Execute preloaded code in PAL mode to initialize
e.g., update TLB - Once OS ready, it sets PC to appropriate addr in
seg0 - 9 index 1 way-predict 2 4-instr 12 addr
bits are sent to I- 48 9 6 33 bits for v.
tag - Way-prediction and Line-prediction (11-bit for
next 16-byte group on a miss and updated by br
prediction) are used to reduce I- latency - The next way and line prediction is loaded to
read the next block (step 3) on an intr cache
hit. - An instr miss leads to a check of I-TLB
prefetcher (4 7), or access L2 (if instr.
Addr. Not found) (8) - L2 is direct mapped, 1-16MB
56Memory Hierarchy Design
- Putting It All Together Alpha 21264 Memory
Hierarchy - The instruction prefetcher does not rely on TLB
for address translation, it simply increments the
physical address of the miss by 64 bytes,
checking to make sure that the new address is
within the same page. Prefetching is suppressed
if new address is out of the page (step 14). - If the instruction is not found in L2, the
physical address command is sent to the ES40
system chip set via four consecutive transfer
cycles on a narrow, 15-bit outbound address bus
(step 15). Address and command take 8 CPU cycles.
CPU is connected to memory via a crossbar to one
of two 256-bit memory buses (16) - Total penalty of the instruction miss is about
130 CPU cycles for critical instructions, while
the rest of the block is filled at a rate of 8
bytes per 2 CPU cycles (step 17). - A victim file is associated with L2, a
write-back cache, to store a replaced block
(victim block) (step 18) The address of the
victim is sent out the system address bus
following the address of the new request (step
19), where the system chip set later extracts the
victim data and writes to the memory. - D- is a 64-KB two-way set-associative,
write-back cache, which is virtually indexed but
physically tagged. While the 9-bit index 3-bit
word selection is sent to index the required data
(step 24), virtual page is being translated at
D-TLB (step 23), which is fully associative and
has 128 PTEs of which each represents page size
from 8KB to 4MB (step 25).
57Memory Hierarchy Design
- Putting It All Together Alpha 21264 Memory
Hierarchy - A TLB miss will trap to PAL (privileged
architecture library) code to load the valid PTE
for this address. In the worst case, a page-fault
happens, in which case OS will bring the page
from disk while context is switched. - The index field of the address is sent to both
sets of data cache (step 26). Assuming a TLB hit,
the two tags and valid bits are compared to the
physical page (steps 27-28), with a match
sending the desired 8 bytes to the CPU (step 29) - A miss at D- goes to L2 , which proceeds
similary to an instruction miss (step 30), except
that it must check the victim buffer to make sure
the block is not there (step 31) - A write-back victim can be produced on a data
cache miss. The victim data are extracted from
the data cache simultaneously with the fill of
the data cache with the L2 data and sent to the
victim buffer (step 32) - In case of a L2 miss, the fill data from the
system is written directly into the (L1) data
cache (step 33). The L2 is written only with L1
victims (step 34).
58Memory Hierarchy Design
- Another View The Emotion Engine of the Sony
Playstation 2 - 3 Cs captured by cache for SPEC2000 (left) and
multimedia applications (right)
59Memory Hierarchy Design
- Another View The Emotion Engine of the Sony
Playstation 2