Final Exam Review - PowerPoint PPT Presentation

About This Presentation

Title:

Final Exam Review

Description:

Bus/Custom-Designed Network. Shared Memory. 34. COMP381 by M. Hamdi ... Custom-Designed Network. MB : Memory Bus NIC : Network Interface Circuitry. 35 ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 39

Provided by: mot112

Category:

more less

Transcript and Presenter's Notes

Title: Final Exam Review

1
Final Exam Review
2
Exam Format

It will cover material after the mid-term (Cache
to multiprocessors)
It is similar to the style of mid-term exam
We will have 6-7 questions in the exam
One question true/false or short questions which
covers general topics.
5-6 other questions require calculation

3
Memory Systems
4
Memory Hierarchy - the Big Picture

Problem memory is too slow and/or too small
Solution memory hierarchy

Larger Capacity
Slowest
Fastest
Biggest
Smallest
Lowest
Highest
5
Why Hierarchy Works

The principle of locality
Programs access a relatively small portion of the
address space at any instant of time.
Temporal locality recently accessed
instruction/data is likely to be used again
Spatial locality instruction/data near recently
accessed /instruction data is likely to be used
soon
Result the illusion of large, fast memory

6
Cache Design Operation Issues

Q1 Where can a block be placed cache?
(Block placement strategy Cache
organization)
Fully Associative, Set Associative, Direct
Mapped.
Q2 How is a block found if it is in cache?
(Block identification)
Tag/Block.
Q3 Which block should be replaced on a miss?
(Block replacement)
Random, LRU.
Q4 What happens on a write? (Cache write
policy)
Write through, write back.

7
Q1 Block Placement

Where can block be placed in cache?
In one predetermined place - direct-mapped
Use fragment of address to calculate block
location in cache
Compare cache block with tag to test if block
present
Anywhere in cache - fully associative
Compare tag to every block in cache
In a limited set of places - set-associative
Use address fragment to calculate set
Place in any block in the set
Compare tag to every block in set
Hybrid of direct mapped and fully associative

8
Q2 Block Identification

Every cache block has an address tag and index
that identifies its location in memory
Hit when tag and index of desired word
match(comparison by hardware)
Q What happens when a cache block is empty?A
Mark this condition with a valid bit

Tag/index
Valid
Data
0x00001C0
0xff083c2d
1
9
Cache Replacement Policy

Random
Replace a randomly chosen line
LRU (Least Recently Used)
Replace the least recently used line

10
Write-through Policy
0x1234
0x1234
0x1234
0x5678
0x5678
0x1234
Processor
Cache
Memory
11
Write-back Policy
0x1234
0x1234
0x1234
0x5678
0x9ABC
0x1234
0x5678
0x5678
Processor
Cache
Memory
12
Cache PerformanceAverage Memory Access Time
(AMAT), Memory Stall cycles

The Average Memory Access Time (AMAT) The
number of cycles required to complete an average
memory access request by the CPU.
Memory stall cycles per memory access The
number of stall cycles added to CPU execution
cycles for one memory access.
For an ideal memory AMAT 1 cycle, this
results in zero memory stall cycles.
Memory stall cycles per average memory access
(AMAT -1)
Memory stall cycles per average instruction
Memory stall cycles per average
memory access
x Number
of memory accesses per instruction
(AMAT -1 ) x ( 1
fraction of loads/stores)

Instruction Fetch
13
Cache Performance

Unified cache For a CPU with a single level (L1)
of cache for both instructions and data and no
stalls for cache hits
CPUtime IC x (CPIexecution Mem Stall
cycles per instruction) x Clock cycle time
CPU time IC x CPI execution Memory
accesses/instruction x Miss rate x
Miss penalty x
Clock cycle time
Split Cache For a CPU with separate or split
level one (L1) caches for instructions and
data and no stalls for cache hits
CPUtime IC x (CPIexecution Mem Stall
cycles per instruction) x Clock cycle time
Mem Stall cycles per instruction Instruction
Fetch Miss rate x Miss Penalty Data Memory
Accesses Per Instruction x Data Miss Rate x Miss
Penalty

14
Memory Access TreeFor Unified Level 1 Cache
CPU Memory Access
L1 Hit Hit Rate H1 Access Time
1 Stalls H1 x 0 0 ( No Stall)
L1 Miss (1- Hit rate) (1-H1)
Access time M 1 Stall cycles per access
M x (1-H1)
L1
AMAT H1 x 1 (1 -H1 ) x
(M 1) 1 M x ( 1
-H1) Stall Cycles Per Access AMAT - 1
M x (1 -H1)
M Miss Penalty H1 Level 1 Hit Rate 1- H1
Level 1 Miss Rate
15
Memory Access TreeFor Separate Level 1 Caches
CPU Memory Access
Instruction
Data
L1
Instruction L1 Hit Access Time 1 Stalls 0
Instruction L1 Miss Access Time M
1 Stalls Per access instructions x (1 -
Instruction H1 ) x M
Data L1 Miss Access Time M 1 Stalls per
access data x (1 - Data H1 ) x M
Data L1 Hit Access Time 1 Stalls 0
Stall Cycles Per Access Instructions x ( 1
- Instruction H1 ) x M data x (1 - Data
H1 ) x M AMAT 1 Stall Cycles per access

16
Cache Performance (various factors)

Cache impact on performance
With and without cache
Processor clock rate
Which one performs better unified or split
Assuming same size
What is the effect of cache organization on cache
performance 1-way, 8-way set associative
Tradeoffs between hit-time and hit-rate

17
Cache Performance (various factors)

What is the affect of write policy on cache
performance Write back or write through write
allocate vs. no-write allocate
Stall Cycles Per Memory Access reads x (1
- H1 ) x M write x M
Stall Cycles Per Memory Access (1-H1) x
( M x clean 2M x dirty )
What is the effect of cache levels on
performance
Stall cycles per memory access (1-H1) x H2 x
T2 (1-H1)(1-H2) x M
Stall cycles per memory access (1-H1) x H2
x T2 (1-H1) x (1-H2) x H3 x T3
(1-H1)(1-H2) (1-H3)x M

18
Performance Equation
To reduce CPUtime, we need to reduce Cache Miss
Rate
19
Reducing Misses (3 Cs)

Classifying Cache Misses 3 Cs
Compulsory(Misses even in infinite size cache)
Capacity(Misses due to size of cache)
Conflict(Misses due to associative and size of
cache)
How to reduce the 3 Cs (Miss rate)
Increase Block Size
Increase Associativity
Use a Victim Cache
Use a Pseudo Associative Cache
Use a prefetching technique

20
Performance Equation
To reduce CPUtime, we need to reduce Cache Miss
Penalty
21
Memory Interleaving Reduce miss penalty
Interleaving
Default
Begin accessing one word, and while waiting,
start accessing other three words (pipelining)
Must finish accessing one word before starting
the next access
(1251)x4 108 cycles
30 cycles
Requires 4 separate memories, each 1/4 size
Interleaving worksperfectly with caches
Spread out addresses among the memories
22
Memory Interleaving An Example

Given the following system parameters with single
cache level L1
Block size1 word Memory bus width1 word
Miss rate 3 Miss penalty27 cycles
(1 cycles to send address 25 cycles access
time/word, 1 cycles to send a word)
Memory access/instruction 1.2 Ideal CPI
(ignoring cache misses) 2
Miss rate (block size2 word)2 Miss rate
(block size4 words) 1
The CPI of the base machine with 1-word blocks
2(1.2 x 0.03 x 27) 2.97
Increasing the block size to two words gives the
following CPI
32-bit bus and memory, no interleaving 2 (1.2
x .02 x 2 x 27) 3.29
32-bit bus and memory, interleaved 2 (1.2 x
.02 x (28)) 2.67
Increasing the block size to four words
resulting CPI
32-bit bus and memory, no interleaving 2 (1.2
x 0.01 x 4 x 27) 3.29
32-bit bus and memory, interleaved 2 (1.2 x
0.01 x (30)) 2.36

23
Cache vs. Virtual Memory

Motivation for virtual memory (Physical memory
size, multiprogramming)
Concept behind VM is almost identical to concept
behind cache.
But different terminology!
Cache Block VM Page
Cache Cache Miss VM Page Fault
Caches implemented completely in hardware. VM
implemented in software, with hardware support
from CPU.
Cache speeds up main memory access, while main
memory speeds up VM access
Translation Look-Aside Buffer (TLB)
How to calculate the size of page tables for a
given memory system
How to calculate the size of pages given the size
of page table

24
Virtual Memory Definitions

Key idea simulate a larger physical memory than
is actually available
General approach
Break address space up into pages
Each program accesses a working set of pages
Store pages
In physical memory as space permits
On disk when no space left in physical memory
Access pages using virtual address

Individual Pages
Memory Map
Disk
Physical Memory
Virtual Memory
25
I/O Systems
26
I/O Systems
27
I/O concepts

Disk Performance
Disk latency average seek time average
rotational delay transfer time controller
overhead
Interrupt-driven I/O
Memory-mapped I/O
I/O channels
DMA (Direct Memory Access)
I/O Communication protocols
Daisy chaining
Polling
I/O Buses
Synchronous vs. asynchronous

28
RAID Systems

Examined various RAID architectures RAID0-RAID5
Cost, Performance (BW, I/O request rate)
RAID-0 No redundancy
RAID-1 Mirroring
RAID-2 Memory-style ECC
RAID-3 bit-interleaved parity
RAID-4 block-interleaved parity
RAID-5 block-interleaved distributed parity

29
Storage Architectures

Examined various Storage architectures (Pros. And
Cons)
DAS - Directly-Attached Storage
NAS - Network Attached Storage
SAN - Storage Area Network

30
Multiprocessors
31
Motivation

Application needs
Amdhals law
T(n)
As n ? ?, T(n) ?
Gustafsons law
T'(n) s np T'(?) ? ?!!!!

1 sp/n
1 s
32
Flynns Taxonomy of Computing

SISD (Single Instruction, Single Data)
Typical uniprocessor systems that weve studied
throughout this course.
SIMD (Single Instruction, Multiple Data)
Multiple processors simultaneously executing the
same instruction on different data.
Specialized applications (e.g., image
processing).
MIMD (Multiple Instruction, Multiple Data)
Multiple processors autonomously executing
different instructions on different data.

33
Shared Memory Multiprocessors
Shared Memory
34
MPP (Massively Parallel Processing)Distributed
Memory Multiprocessors
MB Memory Bus NIC Network Interface Circuitry
MB
MB
P/C
P/C
LM
LM
NIC
NIC
Custom-Designed Network
35
Cluster
LD Local Disk IOB I/O Bus
MB
MB
P/C
P/C
M
M
Bridge
Bridge
LD
LD
IOB
IOB
NIC
NIC
Commodity Network (Ethernet, ATM, Myrinet)
36
Grid
P/C
P/C
P/C
P/C
IOC
IOC
Hub/LAN
Hub/LAN
NIC
NIC
LD
LD
SM
SM
SM
SM
Internet
37
Multiprocessor concepts