CENG 450 Computer Systems and Architecture Lecture 15

About This Presentation

Title:

CENG 450 Computer Systems and Architecture Lecture 15

Description:

Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali_at_ece.uvic.ca – PowerPoint PPT presentation

Number of Views:249

Avg rating:3.0/5.0

Slides: 32

Provided by: Shin174

Category:

more less

Transcript and Presenter's Notes

Title: CENG 450 Computer Systems and Architecture Lecture 15

1
CENG 450Computer Systems and
ArchitectureLecture 15

Amirali Baniasadi
amirali_at_ece.uvic.ca

2
Announcements

Last Quiz scheduled for March 31st.

3
Cache Write Policy Write Through versus Write
Back

Cache read is much easier to handle than cache
write
Instruction cache is much easier to design than
data cache
Cache write
How do we keep data in the cache and memory
consistent?
Two options
Write Back write to cache only. Write the cache
block to memory when that cache block is being
replaced on a cache miss.
Need a dirty bit for each cache block
Greatly reduce the memory bandwidth requirement
Control can be complex
Write Through write to cache and memory at the
same time.
Isnt memory too slow for this?

4
Write Buffer for Write Through
Cache
Processor
DRAM
Write Buffer

A Write Buffer is needed between the Cache and
Memory
Processor writes data into the cache and the
write buffer
Memory controller write contents of the buffer
to memory
Write buffer is just a FIFO
Typical number of entries 4
Works fine if Store frequency (w.r.t. time) ltlt
1 / DRAM write cycle
Memory system designers nightmare
Store frequency (w.r.t. time) -gt 1 / DRAM
write cycle
Write buffer saturation

5
Write Buffer Saturation
Cache
Processor
DRAM
Write Buffer

Store frequency (w.r.t. time) -gt 1 / DRAM
write cycle
If this condition exist for a long period of time
(CPU cycle time too quick and/or too many store
instructions in a row)
Store buffer will overflow no matter how big you
make it
The CPU Cycle Time lt DRAM Write Cycle Time
Solution for write buffer saturation
Use a write back cache
Install a second level (L2) cache

Cache
L2 Cache
Processor
DRAM
Write Buffer
6
Improving Cache Performance

Average Memory Access Time Hit Time Miss Rate
Miss Penalty
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

7
1. Reduce Misses via Larger Block Size
8
2. Reduce Misses via Higher Associativity

21 Cache Rule
Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
Beware Execution time is only final measure!
Will Clock Cycle time increase?
Hill 1988 suggested hit time for 2-way vs.
1-way external cache 10, internal 2

9
3. Reducing Misses via a Victim Cache

How to combine fast hit time of direct mapped
yet still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache
Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
10
4. Reducing Misses via Pseudo-Associativity

How to combine fast hit time of Direct Mapped and
have the lower conflict misses of 2-way
associative cache?
Divide cache on a miss, check other half of
cache to see if the data is there, if so have a
pseudo-hit (slow hit)
Drawback Difficult to build a CPU pipeline if
hit may take either 1 or 2 cycles
Better for caches not tied directly to processor
(L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time
Miss Penalty
Pseudo Hit Time
Time
11
5. Reducing Misses by Compiler Optimizations

Instructions
Not discussed here.
Data
Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays
Loop Interchange change nesting of loops to
access data in order stored in memory
Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap

12
Merging Arrays Example

/ Before 2 sequential arrays /
int valSIZE
int keySIZE
/ After 1 array of structures /
struct merge
int val
int key
struct merge merged_arraySIZE
Reducing conflicts between val key improve
spatial locality

13
Loop Interchange Example

/ Before /
for (k 0 k lt 100 k k1)
for (j 0 j lt 100 j j1)
for (i 0 i lt 5000 i i1)
xij 2 xij
/ After /
for (k 0 k lt 100 k k1)
for (i 0 i lt 5000 i i1)
for (j 0 j lt 100 j j1)
xij 2 xij
Sequential accesses instead of striding through
memory every 100 words improved spatial locality

14
Loop Fusion Example

/ Before /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
aij 1/bij cij
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
dij aij cij
/ After /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
aij 1/bij cij
dij aij cij
2 misses per access to a c vs. one miss per
access improve spatial locality

15
Summary of Compiler Optimizations (by hand)
16
Summary Miss Rate Reduction

3 Cs Compulsory, Capacity, Conflict
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by Compiler Optimizations

17
Improving Cache Performance

1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

18
1. Reduce Miss Penalty Early Restart and
Critical Word First

Dont wait for full block to be loaded before
restarting CPU
Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Generally useful only when cache line gt bus width
Spatial locality a problem tend to want next
sequential word, so not clear if benefit by early
restart

block
19
2. Reduce Miss Penalty Non-blocking Caches to
reduce stalls on misses

Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss
hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses
Pentium Pro allows 4 outstanding memory misses

20
3 Use a multi-level cache

L2 Equations
AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1
Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2
AMAT Hit TimeL1
Miss RateL1 x (Hit TimeL2 Miss RateL2 x
Miss PenaltyL2)
Definitions
Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2)
Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters

21
Reducing Misses Which apply to L2 Cache?

Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher
Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via
Pseudo-Associativity
5. Reducing Capacity/Conf. Misses by Compiler
Optimizations

22
L2 cache block size A.M.A.T.

32KB L1, 8 byte path to memory

23
Reducing Miss Penalty Summary

Three techniques
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under
Miss)
Second Level Cache
Can be applied recursively to Multilevel Caches
Danger is that time to DRAM will grow with
multiple levels in between

24
Summary The Cache Design Space

Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs. write-back
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
25
IBM POWER4 Memory Hierarchy
4 cycles to load to a floating point
register 128-byte blocks divided into 32-byte
sectors
L1(Instr.) 64 KB Direct Mapped
L1(Data) 32 KB 2-way, FIFO
write allocate 14 cycles to load to a
floating point register 128-byte blocks
L2(Instr. Data) 1440 KB, 3-way,
pseudo-LRU (shared by two processors)
L3(Instr. Data) 128 MB 8-way (shared by two
processors)
? 340 cycles 512-byte blocks divided into
128-byte sectors
26
Intel Itanium Processor
L1(Data) 16 KB, 4-way dual-ported write through
L1(Instr.) 16 KB 4-way
32-byte blocks 2 cycles
64-byte blocks write allocate 12 cycles
L2 (Instr. Data) 96 KB, 6-way
4 MB (on package, off chip)
64-byte blocks 128 bits bus at 800 MHz (12.8
GB/s) 20 cycles
27
3rd Generation Itanium

1.5 GHz
410 million transistors
6MB 24-way set associative L3 cache
6-level copper interconnect, 0.13 micron
130W (i.e. lasts 17s on an AA NiCd)

28
Cache performance
29
Impact on Performance

Suppose a processor executes at
Clock Rate 1 GHz (1 ns per cycle), Ideal (no
misses) CPI 1.1
50 arith/logic, 30 ld/st, 20 control
Suppose that 10 of memory operations get 100
cycle miss penalty
Suppose that 1 of instructions get same miss
penalty

78 of the time the proc is stalled waiting for
memory!
30
Example Harvard Architecture

Unified vs. Separate ID (Harvard)
16KB ID Inst miss rate0.64, Data miss
rate6.47
32KB unified Aggregate miss rate1.99
Which is better (ignore L2 cache)?
Assume 33 data ops ? 75 accesses from
instructions (1.0/1.33)
hit time1, miss time50
Note that data hit has 1 stall for unified cache
(only one port)
AMATHarvard75x(10.64x50)25x(16.47x50)
2.05
AMATUnified75x(11.99x50)25x(111.99x50)
2.24

31
Summary

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Temporal Locality Locality in Time
Spatial Locality Locality in Space
Three Major Categories of Cache Misses
Compulsory Misses sad facts of life. Example
cold start misses.
Conflict Misses increase cache size and/or
associativity. Nightmare Scenario ping pong
effect!
Capacity Misses increase cache size
Write Policy
Write Through need a write buffer. Nightmare
WB saturation
Write Back control can be complex
Cache Performance

Write a Comment

User Comments (0)

About PowerShow.com

CENG 450 Computer Systems and Architecture Lecture 15 - PowerPoint PPT Presentation

CENG 450 Computer Systems and Architecture Lecture 15

Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali_at_ece.uvic.ca – PowerPoint PPT presentation