EECSCS 370

About This Presentation

Title:

EECSCS 370

Description:

An entry consists of a block of data and the associated tag, valid, dirty bits. ... How many dirty blocks are evicted when executing the instruction sequence? ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 40

Provided by: garyt

Category:

more less

Transcript and Presenter's Notes

Title: EECSCS 370

1
EECS/CS 370

Review for Exam 3
Nov 14, 2001

2
The Facts

50 minute exam
Friday, Nov 16, 2001 240pm 330pm
Dow 1013, 1014, 1018, 1005
EECS 1200, 1003, 1301
Closed Book Closed Note
Calculators are allowed (for arithmetic)

3
Exam Format

15 multiple choice questions
4 fairly easy questions
8 medium difficulty (and more time consuming)
3 are difficult (extending material covered in
class)
Length probably will be a long exam
Dont spend all your time on 1 question, if you
are stuck, skip it and move on!!!!!!!!!!!!
No figures as last time

4
Important Topics

Project 3 (hazards, forwarding, etc.)
Advanced pipelining topics (concepts not details)
Superscalar, superpipelining, out-of-order
execution
Details of Pentium, AMD not important
Caches, caches, caches (MOST important topic)
Reference stream, associativity, write policy,
performance
Virtual memory
Page table, TLB, virtual vs physically addressed
caches

5
Other Stuff to Know

Pipeline organization for Project 3
LC2k1 assembly code (format, semantics)
You should know this by now, if not something is
wrong with you ?
Concepts from previous exams that continue to
apply are important (CPI, pipelining, hazards,
etc.)

6
Notes

25 sample questions follow, correct answers are
indicated by a
Explanations of the answers are in blue
Cache line vs Cache set
I use the term line (and thus line index), the
book uses the term set (and thus set index).
These are the same thing. If you think of a
cache as a matrix of data, rows are lines/sets.
Within each row are k-entries where k is the
associativity. An entry consists of a block of
data and the associated tag, valid, dirty bits.
Each line/set has 1 set of bits used for the LRU
replacement.

7
Adv pipelining - Easy

Which of the following are FALSE about the term
superpipelining
It generally refers to processor implementations
with more than 5 pipeline stages
It often enables the clock frequency to be
increased
It often reduces the number of data hazards
It likely increases the penalty for mispredicted
branches
The Pentium-4 is an example of a superpipelined
processor
It reduces CPI

8
Adv pipelining

Superscalar implementations generally increase
performance by what?
Reducing CPI
Increasing clock frequency
Reducing the number of instructions executed
Reducing the number of control hazards
Increasing the number of pipeline stages

9
Adv pipelining

In an out-of-order execution processor, whats the
primary function of the reorder buffer?
Fetch multiple instructions from memory to
support superscalar execution
Reorder instructions to resolve data hazards
Check for hazards between instructions to
determine what can issue in parallel
Maintain the original sequential order of
instructions through execution and retirement
Sequentialize instructions in the order they
finish execution to prepare for retirement

10
Adv pipelining (harder)

Supposed you wanted to modify the LC2k1 pipelined
processor used in Project 3 to execute 2
instructions at the same time by having 2
parallel pipelines where pairs of sequential
instructions will be executed in the pipelines.
How many INTER-PIPELINE data hazard checks will
be necessary to make this work properly?
6
10
14
18
22

Consider a single pipeline. Need to check an
instructions source operands with the
destinations of the 3 previous instructions,
hence 6 checks. Now consider with 2 pipelines.
Focus on pipeline1, you need to check both
sources of an instruction with the destinations
of the 3 previous instructions in BOTH pipelines.
That is, 12 total checks, 6 are
inter-pipeline. Same for pipeline 2, so 12 of
total inter-pipeline checks so far. Now you
also cannot execute 2 instructions in parallel if
the first defines say r3 and the second uses r3,
because you wont be able to forward properly.
So, 2 more inter-pipeline checks are needed to
check if the sources of the instruction
in pipeline2 match the destination of the
instruction in pipeline1. Total 14
11
Adv pipelining (contd)

One note about the previous problem. The opcode
specificness of some of the hazard checks has
been ignored such as a lws destination is regB
whereas an adds is destReg. But for a question
like this, think about it from the hardware
perspective, i.e., how many comparators do you
need?, rather than from a programming perspective
where you think about how many if-statements you
need. And by need, comparators cost , so put
in as few as possible. You are allowed to MUX
different inputs into those comparators based on
opcodes. So, you really only need to do a
maximum of 14 inter-pipeline comparisons each
cycle to detect hazards.

12
Project 3 medium

For the LC2k1 pipeline simulator you wrote for
Project 3, suppose you wanted to test your
simulator to make sure the data forwarding from
the MEM/WB pipeline register to the EX stage was
working properly. Which of the following
testcases would expose a bug in this part of your
code? Note, you may assume the simulator is
operating correctly with the exception of the
forwarding path being tested. Further forwarding
to either source operand is considered a
sufficient test case. For the testcases, assume
initial register contents of r1, r22, r33, all
the rest are 0. Further assume the contents of
memory location 10 is 10.

13
Project 3 (contd)
(a)
(b)
(d)
(c)
add 2 2 2 add 2 2 3 nand 3 4 5 lw 5 3 11
add 1 2 3 lw 3 4 7 sw 2 4 7 beq 3 3 2
lw 2 4 8 add 2 2 2 add 2 2 2 sw 2 4 2
beq 3 3 1 add 1 2 4 add 1 3 5 add 4 5 6
(e) Multiple of these tecases can be used to
expose the bug

To exercise the forwarding path from the MEM/WB
pipeline stage,
you need a def followed by a use that is 2
instructions later, i.e.,
def
some instruction
use of the def
Is not correct because all def/use pairs are
consecutive instructions
Lw followed by immediate use (sw) causes a noop
to be inserted,
then the lw will forward its result to
the sw from MEM/WB as desired
(c) Is not correct, all def/use pairs are
consecutive or 3 instructions apart
Beq is taken so instruction 2 is not executed.
Only consecutive def/use
with instruction 2 not executed, hence
not correct

14
Caches - Easy

In order to reduce compulsory misses of a
direct-mapped, write-through cache with LRU
replacement, which technique in general is the
most effective?
Increase cache size
Increase block size
Increase cache associativity
Change the replacement policy from LRU to FIFO
Make the cache write-back

15
Caches

Direct-mapped caches are often undesirable
because?
A larger number of parallel tag compares must be
performed
Tag fields are larger making each tag compare
more expensive
Cannot support write-back designs, thus they have
higher memory traffic
Higher miss rates are suffered because of more
conflict misses
Higher miss rates are suffered because of more
capacity misses

16
Caches

Which of the following statements is FALSE about
the principle of temporal locality?
If a program references location 2367, its more
likely to reference that location again than any
other random location
Any data that is referenced and is not in the
cache, should be put into the cache
LRU should be utilized as the replacement policy
Increasing block size makes use of the principle
to a larger extent
All of these are true statements

17
Caches

What happens to the tag field when the cache
associativity goes from 4 to 2?
Tag size increases by 1 bit
Tag size decreases by 1 bit
Tag size increases by 2 bits
Tag size decreases by 2 bits
Tag size is unchanged, but the bits that are used
for the tag are changed

By halving the associativity, the number of
cache lines is doubled, increasing the line index
field by 1 bit. Hence the tag is reduced by 1
bit, as address bits must be conserved.
18
Caches

With the following processor configuration, how
many lines does the cache have?
24
26
28
210
212

32-bit memory addresses Byte addressable 16KB
cache, 4-way associative 64-byte blocks
cache 16KB 214 bytes num blocks in cache
size in bytes / num bytes per block
214 / 26 28 num lines
in cache num blocks / associativity (as each
line contains k
blocks for a k-way assocative cache
28 / 4 26, hence B
19
Caches (harder)

Consider the following cache configuration and
instruction stream. This information applies to
the next 7 questions.

Instruction stream
Cache configuration
Size 32 bytes Block size 4 bytes Write policy
write-back, write-allocate 2-way associative LRU
replacement 8-bit, byte-addressable addresses
Ref1 load to addr 10001001 Ref2 load to addr
10001110 Ref3 store to addr 10001101 Ref4 load
to addr 10101111 Ref5 store to addr
11001110 Ref6 store to addr 10001010 Ref7 load
to addr 11111100 Ref8 store to addr 00001111
20
Instruction Execution Analysis
First calculate the breakdown of the address
bits Block size 4 bytes 2 bit block
offset Num blocks in cache 32 bytes / 4 bytes
per block 8 2 way set assoc means 2 blocks per
line (or set) Number lines 8 blocks / 2 blocks
per line 4 2 bits for line index Remaining
bits are tag, 8 2 2 4 bits
Instruction execution Ref1 miss, insert into
line 10, way 0 Ref2 miss, insert into line 11,
way 0 Ref3 hit into line 11, way 0, mark as
dirty Ref4 miss, insert into line 11, way
1 Ref5, miss, evict line 11, way 0
block is dirty, write back to mem
insert into line 11, way 0, mark as dirty Ref6,
hit into line 10, way 0, mark as dirty Ref7,
miss, evict line 11, way 1 block is
clean, so no write back insert into
line 11, way 1 Ref8, miss, evict line 11, way 0
block is dirty, so write back to mem
insert into line 11, way 0
Address breakdown
4
2
2
tag
line index
block offset
Cache organization
way0
way1
v d tag
data
v d tag
data
Line 00
v d tag
data
v d tag
data
Line 01
Line 10
v d tag
data
v d tag
data
Line 11
v d tag
data
v d tag
data
data 4 byte block, also have 1 LRU bit per line
21
Caches

How many cache hits are obtained when executing
this instruction sequence?
1
2
3
4
5

22
Caches

How many dirty blocks are evicted when executing
the instruction sequence? Note dont count any
dirty blocks remaining after the instruction
sequence is completed.
0
1
2
3
4

23
Caches

How many bytes of data are transferred to/from
the memory during the execution of the
instruction sequence? Again dont count the
eviction of any dirty block that remains after
the instruction sequence is completed.
20
24
28
32
36

Memory traffic memory transfers to replace
evicted blocks memory transfers to
writeback dirty blocks Evicted blocks
transfers 6 cache misses 1 block transfer per
miss 4 bytes/block 24 bytes Dirty writeback
transfers 2 dirty writebacks 1 block transfer
per writeback 4 bytes/block 8 bytes Total
24 8 32 bytes
24
Caches

Whats the tag of the second block that is
replaced in the cache during the execution of the
instruction sequence
1000
100011
1010
101011
1100

Note, I gave the answer (A) during the review.
That is wrong, the correct answer is (C).
25
Caches

How many BITS are required to construct the cache
(including data, tags, valid bits, LRU, etc)?
252
280
292
308
324

Total data storage overhead (ie tag, valid,
dirty, LRU) data storage 32 bytes 8 bits/byte
256 bits Overhead 4 lines 2 ways per line
(4 bits of tag 1 valid
bit 1 dirty bit) 4 lines
1 LRU bit per line 52 Total
256 52 308
26
Caches

If the cache is made write-through, which of the
following is True when executing the instruction
sequence?
Number of hits increases
Number of hits decreases
Amount of memory traffic increases
Amount of memory traffic decreases
Both hits and memory traffic stay constant

Remember, write-through/write-back does not
affect the number of hits/misses. The same data
is brought into the cache whether you have
write-through or write-back. In general,
write-through causes the memory traffic to
increase as each store causes a change in
memory. However in this example, there are 4
stores, each causes 1 byte of memory to be
written, or a total of 4 bytes of dirty data
written to memory. In the original write-back
strategy, 2 dirty blocks were replaced causing 2
block updates to memory, or 2 4 bytes 8 bytes
of dirty data written back to memory. Memory
traffic includes these memory updates memory
transfers for cache misses. However, the number
of cache misses is constant for both strategies,
so writeback transfers 4 more bytes of data to
memory. Thus (d) is the correct answer.
27
Caches

If the cache is made no-write allocate, which of
the following is True?
Number of store hits increases
Number of store hits decreases
Number of load hits increases
Number of load hits decreases
Number of blocks replaced decreases

You need to run through the entire execution of
the instruction sequence again for this question.
The same references achieve hits/misses as they
did before, this is in general NOT true, just
happened here. The difference is store misses do
not allocate an entry in the cache, so Ref5 and
Ref8 used to cause evictions, this no longer
happens. Only 1 eviction results from this
execution, hence (e) is correct.
28
Cache Performance

Given a 200 MHz processor with 8KB data and 8KB
instruction caches, the cost to access the main
memory is 20 cycles. Both caches are 2-way
associative. A program running on this processor
has a 95 Icache hit rate and a 90 Dcache hit
rate. On average, 30 of the instructions which
execute are loads or stores. Assume that the
base CPI for this program running on a machine
with an ideal memory system is 1. What is the
CPI of the program on the machine with the
specified cache configuration?
2.0
2.6

C. 3.0 D. 3.6
E. 4.0
29
Cache Performance (contd)
Overall CPI base or useful CPI stall cycle
CPI Stall cycle CPI stall CPI of icache stall
CPI of dcache Assume 100 instructions Icache
stalls 95 hit rate 5 misses per 100
instructions 5 misses 20 cycles/miss 100
stall cycles 100 stall cycles / 100 instructions
1.0 CPI worth of icache stall Dcache
stalls 90 hit rate, 30 lw/sw 3 misses per
100 instructions 3 misses 20 cycles/miss 60
stall cycles 60 stall cycles / 100 instructions
0.6 CPI worth of dcache stall Stall cycle CPI
1.0 0.6 1.6 Overall CPI 1 1.6 2.6
30
Cache Performance

Given the same 200 MHz processor with 8KB
instruction and data caches, with memory access
latency of 20 cycles. Both caches are 2-way
associative. A programming running on this
processor has a 95 icache hit rate and a 90
dcache hit rate. On average, 30 of the
instructions are loads or stores. Suppose you
have 2 options for the next generation processor
Option 1 Double the clock frequency, this will
increase your memory latency to 50 cycles.
Assume a base CPI of 1 can still be achieved
after this change.
Option 2 Double the size of your caches, this
will increase the instruction cache hit rate to
98 and the data cache hit rate to 95. Assume
the hit latency is still 1 cycle.

31
Cache Performance ( contd)

Which of the following is true about these
choices?
Option 1 improves the program execution time by a
larger amount despite increasing CPI
Option 2 improves the program execution time by a
larger amount despite increasing CPI
Option 1 improves the program execution time by a
larger amount by decreasing CPI
Option 2 improves the program execution time by a
larger amount by decreasing CPI
The company building this processor will be
bankrupt soon because both of these options are
lame designs

32
Cache Performance (contd)
Base CPI from previous question was 2.6 CPI
base CPI icache stall CPI dcache stall
CPI Execution time CPI num instructions
cycle time Base cycle time 1/200MHz
1/200106 5 10-9 5 ns Option 1 (double
clock freq, so cycle time is 2.5 ns) CPI 1.0
(5 50)/100 (3 50)/100 5.0 Execution time
5.0 I 2.5ns 12.5 I Option 2 CPI 1.0
(2 20)/100 (1.520)/100 1.7 Execution
time 1.7 I 5ns 8.5 I Option 2 is the
better choice. Achieved CPI is 1.7 compared with
the original value of 2.6, hence (d) is the right
answer.
33
Caches/Virtual memory

Which of the following is an architectural
parameter (ie what the machine language sees) in
a processor like the Mips?
Cache size
Page size
Physical memory address width
Virtual memory address width
Page table size

34
Virtual Memory

Which item is in a TLB entry but NOT in a page
table entry?
Virtual page number
Physical page number
Dirty bit
Access permissions
Valid bit

35
Virtual Memory

Which of the following is TRUE about
virtual/physical addressed caches?
In a virtual addressed caches, the TLB lookup
must be performed first
Physical addressed caches can be accessed faster
than virtual addressed caches
Virtual address caches handle virtual addressing
aliasing between 2 processes in a more effective
manner
In a physical addressed cache, the bits from the
virtual addressed are used to do the tag match
All of these are false

36
Virtual Memory

Given a system with multiple processes and a
direct-mapped cache. Which cache is likely to
have more conflict misses?
Physically addressed
Virtually addressed
Depends on the application
This question is too hard to deserve an answer
I hate 370 questions

Note, in the review, (c ) was the correct answer
as the question was incomplete. I added the
first sentence to making the problem now solvable
as 2 processes can utilize the same virtual
addresses for different data, hence you can get
conflicts between 2 processes utilizing the same
cache and thus more conflict misses.
37
Virtual Memory

Given the following 32-bit virtual addresses,
byte addressable, 16KB pages, maximal physical
memory of 128 MB. Ignoring all overhead (valid
bits, etc), how large is your page table? Round
your page table entry size up to the nearest byte
boundary for calculation.
218
219
220
221
222

Length of page table num virtual pages Num
virtual pages num virtual bytes / page size
232 / 214
218 Width of page table log2(num physical
pages) valid bit, etc Num physical pages num
physical bytes / page size
227 / 214 213 Width log2(213)
13 bits Ignoring overhead and rounding up to
nearest byte, width 2 bytes Page table size
length width 218 2 219 bytes
38
Virtual Memory

Suppose you are implementing a 2-level
hierarchical page table. The architecture has a
32-bit virtual address and is byte addressable.
Your design has 4KB page, 4 byte page table
entry, and 1024 entries per second-level page
table. How many entries are in your super page
table?
A. 1k
B. 2k
C. 4k
D. 8k
E. 16k

39
Virtual Memory (contd)
Num virtual pages num virtual bytes / page size
232 / 212 220 Each 2nd level page table
has 1024 (210) entries, thus need 1024 of
them So that you have 1 entry per virtual page,
ie 220 / 210 210 1024 Now, there are 1024
second-level page tables, each consisting of 1024
entries. The super page table needs 1 pointer to
each second-level page table, hence it also needs
1024 entries, or 1k entries. The size of the
super PT and the second Level PTs do NOT have to
be the same, it just worked out that way in this
example. FYI, total area of the set of page
tables ignoring overhead bits area of
super page table area of second-level page
tables Total area 102410 bits/entry 1024
tables (1024 entries 4 byte/entry 8
bits/byte) 33564672 bits

Write a Comment

User Comments (0)