Title: Final Exam Review
1Final Exam Review
2Exam Format
- It will cover material after the mid-term (Cache
to multiprocessors) - It is similar to the style of mid-term exam
- We will have 6-7 questions in the exam
- One question true/false or short questions which
covers general topics. - 5-6 other questions require calculation
3Memory Systems
4Memory Hierarchy - the Big Picture
- Problem memory is too slow and/or too small
- Solution memory hierarchy
Larger Capacity
Slowest
Fastest
Biggest
Smallest
Lowest
Highest
5Why Hierarchy Works
- The principle of locality
- Programs access a relatively small portion of the
address space at any instant of time. - Temporal locality recently accessed
instruction/data is likely to be used again - Spatial locality instruction/data near recently
accessed /instruction data is likely to be used
soon - Result the illusion of large, fast memory
6Cache Design Operation Issues
- Q1 Where can a block be placed cache?
(Block placement strategy Cache
organization) - Fully Associative, Set Associative, Direct
Mapped. - Q2 How is a block found if it is in cache?
(Block identification) - Tag/Block.
- Q3 Which block should be replaced on a miss?
(Block replacement) - Random, LRU.
- Q4 What happens on a write? (Cache write
policy) - Write through, write back.
7Q1 Block Placement
- Where can block be placed in cache?
- In one predetermined place - direct-mapped
- Use fragment of address to calculate block
location in cache - Compare cache block with tag to test if block
present - Anywhere in cache - fully associative
- Compare tag to every block in cache
- In a limited set of places - set-associative
- Use address fragment to calculate set
- Place in any block in the set
- Compare tag to every block in set
- Hybrid of direct mapped and fully associative
8Q2 Block Identification
- Every cache block has an address tag and index
that identifies its location in memory - Hit when tag and index of desired word
match(comparison by hardware) - Q What happens when a cache block is empty?A
Mark this condition with a valid bit
Tag/index
Valid
Data
0x00001C0
0xff083c2d
1
9Cache Replacement Policy
- Random
- Replace a randomly chosen line
- LRU (Least Recently Used)
- Replace the least recently used line
10Write-through Policy
0x1234
0x1234
0x1234
0x5678
0x5678
0x1234
Processor
Cache
Memory
11Write-back Policy
0x1234
0x1234
0x1234
0x5678
0x9ABC
0x1234
0x5678
0x5678
Processor
Cache
Memory
12Cache PerformanceAverage Memory Access Time
(AMAT), Memory Stall cycles
- The Average Memory Access Time (AMAT) The
number of cycles required to complete an average
memory access request by the CPU. - Memory stall cycles per memory access The
number of stall cycles added to CPU execution
cycles for one memory access. - For an ideal memory AMAT 1 cycle, this
results in zero memory stall cycles. - Memory stall cycles per average memory access
(AMAT -1) - Memory stall cycles per average instruction
- Memory stall cycles per average
memory access - x Number
of memory accesses per instruction - (AMAT -1 ) x ( 1
fraction of loads/stores)
Instruction Fetch
13Cache Performance
- Unified cache For a CPU with a single level (L1)
of cache for both instructions and data and no
stalls for cache hits - CPUtime IC x (CPIexecution Mem Stall
cycles per instruction) x Clock cycle time - CPU time IC x CPI execution Memory
accesses/instruction x Miss rate x - Miss penalty x
Clock cycle time - Split Cache For a CPU with separate or split
level one (L1) caches for instructions and
data and no stalls for cache hits - CPUtime IC x (CPIexecution Mem Stall
cycles per instruction) x Clock cycle time - Mem Stall cycles per instruction Instruction
Fetch Miss rate x Miss Penalty Data Memory
Accesses Per Instruction x Data Miss Rate x Miss
Penalty
14Memory Access TreeFor Unified Level 1 Cache
CPU Memory Access
L1 Hit Hit Rate H1 Access Time
1 Stalls H1 x 0 0 ( No Stall)
L1 Miss (1- Hit rate) (1-H1)
Access time M 1 Stall cycles per access
M x (1-H1)
L1
AMAT H1 x 1 (1 -H1 ) x
(M 1) 1 M x ( 1
-H1) Stall Cycles Per Access AMAT - 1
M x (1 -H1)
M Miss Penalty H1 Level 1 Hit Rate 1- H1
Level 1 Miss Rate
15Memory Access TreeFor Separate Level 1 Caches
CPU Memory Access
Instruction
Data
L1
Instruction L1 Hit Access Time 1 Stalls 0
Instruction L1 Miss Access Time M
1 Stalls Per access instructions x (1 -
Instruction H1 ) x M
Data L1 Miss Access Time M 1 Stalls per
access data x (1 - Data H1 ) x M
Data L1 Hit Access Time 1 Stalls 0
Stall Cycles Per Access Instructions x ( 1
- Instruction H1 ) x M data x (1 - Data
H1 ) x M AMAT 1 Stall Cycles per access
16Cache Performance (various factors)
- Cache impact on performance
- With and without cache
- Processor clock rate
- Which one performs better unified or split
- Assuming same size
- What is the effect of cache organization on cache
performance 1-way, 8-way set associative - Tradeoffs between hit-time and hit-rate
17Cache Performance (various factors)
- What is the affect of write policy on cache
performance Write back or write through write
allocate vs. no-write allocate - Stall Cycles Per Memory Access reads x (1
- H1 ) x M write x M - Stall Cycles Per Memory Access (1-H1) x
( M x clean 2M x dirty ) - What is the effect of cache levels on
performance - Stall cycles per memory access (1-H1) x H2 x
T2 (1-H1)(1-H2) x M - Stall cycles per memory access (1-H1) x H2
x T2 (1-H1) x (1-H2) x H3 x T3
(1-H1)(1-H2) (1-H3)x M
18Performance Equation
To reduce CPUtime, we need to reduce Cache Miss
Rate
19Reducing Misses (3 Cs)
- Classifying Cache Misses 3 Cs
- Compulsory(Misses even in infinite size cache)
- Capacity(Misses due to size of cache)
- Conflict(Misses due to associative and size of
cache) - How to reduce the 3 Cs (Miss rate)
- Increase Block Size
- Increase Associativity
- Use a Victim Cache
- Use a Pseudo Associative Cache
- Use a prefetching technique
20Performance Equation
To reduce CPUtime, we need to reduce Cache Miss
Penalty
21Memory Interleaving Reduce miss penalty
Interleaving
Default
Begin accessing one word, and while waiting,
start accessing other three words (pipelining)
Must finish accessing one word before starting
the next access
(1251)x4 108 cycles
30 cycles
Requires 4 separate memories, each 1/4 size
Interleaving worksperfectly with caches
Spread out addresses among the memories
22Memory Interleaving An Example
- Given the following system parameters with single
cache level L1 - Block size1 word Memory bus width1 word
Miss rate 3 Miss penalty27 cycles - (1 cycles to send address 25 cycles access
time/word, 1 cycles to send a word) - Memory access/instruction 1.2 Ideal CPI
(ignoring cache misses) 2 - Miss rate (block size2 word)2 Miss rate
(block size4 words) 1 - The CPI of the base machine with 1-word blocks
2(1.2 x 0.03 x 27) 2.97 - Increasing the block size to two words gives the
following CPI - 32-bit bus and memory, no interleaving 2 (1.2
x .02 x 2 x 27) 3.29 - 32-bit bus and memory, interleaved 2 (1.2 x
.02 x (28)) 2.67 - Increasing the block size to four words
resulting CPI - 32-bit bus and memory, no interleaving 2 (1.2
x 0.01 x 4 x 27) 3.29 - 32-bit bus and memory, interleaved 2 (1.2 x
0.01 x (30)) 2.36
23Cache vs. Virtual Memory
- Motivation for virtual memory (Physical memory
size, multiprogramming) - Concept behind VM is almost identical to concept
behind cache. - But different terminology!
- Cache Block VM Page
- Cache Cache Miss VM Page Fault
- Caches implemented completely in hardware. VM
implemented in software, with hardware support
from CPU. - Cache speeds up main memory access, while main
memory speeds up VM access - Translation Look-Aside Buffer (TLB)
- How to calculate the size of page tables for a
given memory system - How to calculate the size of pages given the size
of page table
24Virtual Memory Definitions
- Key idea simulate a larger physical memory than
is actually available - General approach
- Break address space up into pages
- Each program accesses a working set of pages
- Store pages
- In physical memory as space permits
- On disk when no space left in physical memory
- Access pages using virtual address
Individual Pages
Memory Map
Disk
Physical Memory
Virtual Memory
25I/O Systems
26I/O Systems
27I/O concepts
- Disk Performance
- Disk latency average seek time average
rotational delay transfer time controller
overhead - Interrupt-driven I/O
- Memory-mapped I/O
- I/O channels
- DMA (Direct Memory Access)
- I/O Communication protocols
- Daisy chaining
- Polling
- I/O Buses
- Synchronous vs. asynchronous
28RAID Systems
- Examined various RAID architectures RAID0-RAID5
Cost, Performance (BW, I/O request rate) - RAID-0 No redundancy
- RAID-1 Mirroring
- RAID-2 Memory-style ECC
- RAID-3 bit-interleaved parity
- RAID-4 block-interleaved parity
- RAID-5 block-interleaved distributed parity
29Storage Architectures
- Examined various Storage architectures (Pros. And
Cons) - DAS - Directly-Attached Storage
- NAS - Network Attached Storage
- SAN - Storage Area Network
30Multiprocessors
31Motivation
- Application needs
- Amdhals law
- T(n)
- As n ? ?, T(n) ?
- Gustafsons law
- T'(n) s np T'(?) ? ?!!!!
1 sp/n
1 s
32Flynns Taxonomy of Computing
- SISD (Single Instruction, Single Data)
- Typical uniprocessor systems that weve studied
throughout this course. - SIMD (Single Instruction, Multiple Data)
- Multiple processors simultaneously executing the
same instruction on different data. - Specialized applications (e.g., image
processing). - MIMD (Multiple Instruction, Multiple Data)
- Multiple processors autonomously executing
different instructions on different data.
33Shared Memory Multiprocessors
Shared Memory
34MPP (Massively Parallel Processing)Distributed
Memory Multiprocessors
MB Memory Bus NIC Network Interface Circuitry
MB
MB
P/C
P/C
LM
LM
NIC
NIC
Custom-Designed Network
35Cluster
LD Local Disk IOB I/O Bus
MB
MB
P/C
P/C
M
M
Bridge
Bridge
LD
LD
IOB
IOB
NIC
NIC
Commodity Network (Ethernet, ATM, Myrinet)
36Grid
P/C
P/C
P/C
P/C
IOC
IOC
Hub/LAN
Hub/LAN
NIC
NIC
LD
LD
SM
SM
SM
SM
Internet
37Multiprocessor concepts
- SIMD Applications (Image processing)
- MIMD
- Shared memory
- Cache coherence problems
- Bus scalability problems
- Distributed memory
- Interconnection networks
- Cluster of workstations
38Preparation Strategy
- Read this review to focus your preparation
- 1 general question
- 5-6 other questions
- Around 50 for memory systems
- Around 50 I/O and multiprocessors
- Go through the lecture notes
- Go through the training problems
- We will have more office hours for help
- Good luck