Title: EECSCS 370
1EECS/CS 370
- Review for Exam 3
- Nov 14, 2001
2The Facts
- 50 minute exam
- Friday, Nov 16, 2001 240pm 330pm
- Dow 1013, 1014, 1018, 1005
- EECS 1200, 1003, 1301
- Closed Book Closed Note
- Calculators are allowed (for arithmetic)
3Exam Format
- 15 multiple choice questions
- 4 fairly easy questions
- 8 medium difficulty (and more time consuming)
- 3 are difficult (extending material covered in
class) - Length probably will be a long exam
- Dont spend all your time on 1 question, if you
are stuck, skip it and move on!!!!!!!!!!!! - No figures as last time
4Important Topics
- Project 3 (hazards, forwarding, etc.)
- Advanced pipelining topics (concepts not details)
- Superscalar, superpipelining, out-of-order
execution - Details of Pentium, AMD not important
- Caches, caches, caches (MOST important topic)
- Reference stream, associativity, write policy,
performance - Virtual memory
- Page table, TLB, virtual vs physically addressed
caches
5Other Stuff to Know
- Pipeline organization for Project 3
- LC2k1 assembly code (format, semantics)
- You should know this by now, if not something is
wrong with you ? - Concepts from previous exams that continue to
apply are important (CPI, pipelining, hazards,
etc.)
6Notes
- 25 sample questions follow, correct answers are
indicated by a - Explanations of the answers are in blue
- Cache line vs Cache set
- I use the term line (and thus line index), the
book uses the term set (and thus set index).
These are the same thing. If you think of a
cache as a matrix of data, rows are lines/sets.
Within each row are k-entries where k is the
associativity. An entry consists of a block of
data and the associated tag, valid, dirty bits.
Each line/set has 1 set of bits used for the LRU
replacement.
7Adv pipelining - Easy
- Which of the following are FALSE about the term
superpipelining - It generally refers to processor implementations
with more than 5 pipeline stages - It often enables the clock frequency to be
increased - It often reduces the number of data hazards
- It likely increases the penalty for mispredicted
branches - The Pentium-4 is an example of a superpipelined
processor - It reduces CPI
8Adv pipelining
- Superscalar implementations generally increase
performance by what? - Reducing CPI
- Increasing clock frequency
- Reducing the number of instructions executed
- Reducing the number of control hazards
- Increasing the number of pipeline stages
9Adv pipelining
- In an out-of-order execution processor, whats the
primary function of the reorder buffer? - Fetch multiple instructions from memory to
support superscalar execution - Reorder instructions to resolve data hazards
- Check for hazards between instructions to
determine what can issue in parallel - Maintain the original sequential order of
instructions through execution and retirement - Sequentialize instructions in the order they
finish execution to prepare for retirement
10Adv pipelining (harder)
- Supposed you wanted to modify the LC2k1 pipelined
processor used in Project 3 to execute 2
instructions at the same time by having 2
parallel pipelines where pairs of sequential
instructions will be executed in the pipelines.
How many INTER-PIPELINE data hazard checks will
be necessary to make this work properly? - 6
- 10
- 14
- 18
- 22
Consider a single pipeline. Need to check an
instructions source operands with the
destinations of the 3 previous instructions,
hence 6 checks. Now consider with 2 pipelines.
Focus on pipeline1, you need to check both
sources of an instruction with the destinations
of the 3 previous instructions in BOTH pipelines.
That is, 12 total checks, 6 are
inter-pipeline. Same for pipeline 2, so 12 of
total inter-pipeline checks so far. Now you
also cannot execute 2 instructions in parallel if
the first defines say r3 and the second uses r3,
because you wont be able to forward properly.
So, 2 more inter-pipeline checks are needed to
check if the sources of the instruction
in pipeline2 match the destination of the
instruction in pipeline1. Total 14
11Adv pipelining (contd)
- One note about the previous problem. The opcode
specificness of some of the hazard checks has
been ignored such as a lws destination is regB
whereas an adds is destReg. But for a question
like this, think about it from the hardware
perspective, i.e., how many comparators do you
need?, rather than from a programming perspective
where you think about how many if-statements you
need. And by need, comparators cost , so put
in as few as possible. You are allowed to MUX
different inputs into those comparators based on
opcodes. So, you really only need to do a
maximum of 14 inter-pipeline comparisons each
cycle to detect hazards.
12Project 3 medium
- For the LC2k1 pipeline simulator you wrote for
Project 3, suppose you wanted to test your
simulator to make sure the data forwarding from
the MEM/WB pipeline register to the EX stage was
working properly. Which of the following
testcases would expose a bug in this part of your
code? Note, you may assume the simulator is
operating correctly with the exception of the
forwarding path being tested. Further forwarding
to either source operand is considered a
sufficient test case. For the testcases, assume
initial register contents of r1, r22, r33, all
the rest are 0. Further assume the contents of
memory location 10 is 10.
13Project 3 (contd)
(a)
(b)
(d)
(c)
add 2 2 2 add 2 2 3 nand 3 4 5 lw 5 3 11
add 1 2 3 lw 3 4 7 sw 2 4 7 beq 3 3 2
lw 2 4 8 add 2 2 2 add 2 2 2 sw 2 4 2
beq 3 3 1 add 1 2 4 add 1 3 5 add 4 5 6
(e) Multiple of these tecases can be used to
expose the bug
- To exercise the forwarding path from the MEM/WB
pipeline stage, - you need a def followed by a use that is 2
instructions later, i.e., - def
- some instruction
- use of the def
- Is not correct because all def/use pairs are
consecutive instructions - Lw followed by immediate use (sw) causes a noop
to be inserted, - then the lw will forward its result to
the sw from MEM/WB as desired - (c) Is not correct, all def/use pairs are
consecutive or 3 instructions apart - Beq is taken so instruction 2 is not executed.
Only consecutive def/use - with instruction 2 not executed, hence
not correct
14Caches - Easy
- In order to reduce compulsory misses of a
direct-mapped, write-through cache with LRU
replacement, which technique in general is the
most effective? - Increase cache size
- Increase block size
- Increase cache associativity
- Change the replacement policy from LRU to FIFO
- Make the cache write-back
15Caches
- Direct-mapped caches are often undesirable
because? - A larger number of parallel tag compares must be
performed - Tag fields are larger making each tag compare
more expensive - Cannot support write-back designs, thus they have
higher memory traffic - Higher miss rates are suffered because of more
conflict misses - Higher miss rates are suffered because of more
capacity misses
16Caches
- Which of the following statements is FALSE about
the principle of temporal locality? - If a program references location 2367, its more
likely to reference that location again than any
other random location - Any data that is referenced and is not in the
cache, should be put into the cache - LRU should be utilized as the replacement policy
- Increasing block size makes use of the principle
to a larger extent - All of these are true statements
17Caches
- What happens to the tag field when the cache
associativity goes from 4 to 2? - Tag size increases by 1 bit
- Tag size decreases by 1 bit
- Tag size increases by 2 bits
- Tag size decreases by 2 bits
- Tag size is unchanged, but the bits that are used
for the tag are changed
By halving the associativity, the number of
cache lines is doubled, increasing the line index
field by 1 bit. Hence the tag is reduced by 1
bit, as address bits must be conserved.
18Caches
- With the following processor configuration, how
many lines does the cache have? - 24
- 26
- 28
- 210
- 212
32-bit memory addresses Byte addressable 16KB
cache, 4-way associative 64-byte blocks
cache 16KB 214 bytes num blocks in cache
size in bytes / num bytes per block
214 / 26 28 num lines
in cache num blocks / associativity (as each
line contains k
blocks for a k-way assocative cache
28 / 4 26, hence B
19Caches (harder)
- Consider the following cache configuration and
instruction stream. This information applies to
the next 7 questions.
Instruction stream
Cache configuration
Size 32 bytes Block size 4 bytes Write policy
write-back, write-allocate 2-way associative LRU
replacement 8-bit, byte-addressable addresses
Ref1 load to addr 10001001 Ref2 load to addr
10001110 Ref3 store to addr 10001101 Ref4 load
to addr 10101111 Ref5 store to addr
11001110 Ref6 store to addr 10001010 Ref7 load
to addr 11111100 Ref8 store to addr 00001111
20Instruction Execution Analysis
First calculate the breakdown of the address
bits Block size 4 bytes 2 bit block
offset Num blocks in cache 32 bytes / 4 bytes
per block 8 2 way set assoc means 2 blocks per
line (or set) Number lines 8 blocks / 2 blocks
per line 4 2 bits for line index Remaining
bits are tag, 8 2 2 4 bits
Instruction execution Ref1 miss, insert into
line 10, way 0 Ref2 miss, insert into line 11,
way 0 Ref3 hit into line 11, way 0, mark as
dirty Ref4 miss, insert into line 11, way
1 Ref5, miss, evict line 11, way 0
block is dirty, write back to mem
insert into line 11, way 0, mark as dirty Ref6,
hit into line 10, way 0, mark as dirty Ref7,
miss, evict line 11, way 1 block is
clean, so no write back insert into
line 11, way 1 Ref8, miss, evict line 11, way 0
block is dirty, so write back to mem
insert into line 11, way 0
Address breakdown
4
2
2
tag
line index
block offset
Cache organization
way0
way1
v d tag
data
v d tag
data
Line 00
v d tag
data
v d tag
data
Line 01
Line 10
v d tag
data
v d tag
data
Line 11
v d tag
data
v d tag
data
data 4 byte block, also have 1 LRU bit per line
21Caches
- How many cache hits are obtained when executing
this instruction sequence? - 1
- 2
- 3
- 4
- 5
22Caches
- How many dirty blocks are evicted when executing
the instruction sequence? Note dont count any
dirty blocks remaining after the instruction
sequence is completed. - 0
- 1
- 2
- 3
- 4
23Caches
- How many bytes of data are transferred to/from
the memory during the execution of the
instruction sequence? Again dont count the
eviction of any dirty block that remains after
the instruction sequence is completed. - 20
- 24
- 28
- 32
- 36
Memory traffic memory transfers to replace
evicted blocks memory transfers to
writeback dirty blocks Evicted blocks
transfers 6 cache misses 1 block transfer per
miss 4 bytes/block 24 bytes Dirty writeback
transfers 2 dirty writebacks 1 block transfer
per writeback 4 bytes/block 8 bytes Total
24 8 32 bytes
24Caches
- Whats the tag of the second block that is
replaced in the cache during the execution of the
instruction sequence - 1000
- 100011
- 1010
- 101011
- 1100
Note, I gave the answer (A) during the review.
That is wrong, the correct answer is (C).
25Caches
- How many BITS are required to construct the cache
(including data, tags, valid bits, LRU, etc)? - 252
- 280
- 292
- 308
- 324
Total data storage overhead (ie tag, valid,
dirty, LRU) data storage 32 bytes 8 bits/byte
256 bits Overhead 4 lines 2 ways per line
(4 bits of tag 1 valid
bit 1 dirty bit) 4 lines
1 LRU bit per line 52 Total
256 52 308
26Caches
- If the cache is made write-through, which of the
following is True when executing the instruction
sequence? - Number of hits increases
- Number of hits decreases
- Amount of memory traffic increases
- Amount of memory traffic decreases
- Both hits and memory traffic stay constant
Remember, write-through/write-back does not
affect the number of hits/misses. The same data
is brought into the cache whether you have
write-through or write-back. In general,
write-through causes the memory traffic to
increase as each store causes a change in
memory. However in this example, there are 4
stores, each causes 1 byte of memory to be
written, or a total of 4 bytes of dirty data
written to memory. In the original write-back
strategy, 2 dirty blocks were replaced causing 2
block updates to memory, or 2 4 bytes 8 bytes
of dirty data written back to memory. Memory
traffic includes these memory updates memory
transfers for cache misses. However, the number
of cache misses is constant for both strategies,
so writeback transfers 4 more bytes of data to
memory. Thus (d) is the correct answer.
27Caches
- If the cache is made no-write allocate, which of
the following is True? - Number of store hits increases
- Number of store hits decreases
- Number of load hits increases
- Number of load hits decreases
- Number of blocks replaced decreases
You need to run through the entire execution of
the instruction sequence again for this question.
The same references achieve hits/misses as they
did before, this is in general NOT true, just
happened here. The difference is store misses do
not allocate an entry in the cache, so Ref5 and
Ref8 used to cause evictions, this no longer
happens. Only 1 eviction results from this
execution, hence (e) is correct.
28Cache Performance
- Given a 200 MHz processor with 8KB data and 8KB
instruction caches, the cost to access the main
memory is 20 cycles. Both caches are 2-way
associative. A program running on this processor
has a 95 Icache hit rate and a 90 Dcache hit
rate. On average, 30 of the instructions which
execute are loads or stores. Assume that the
base CPI for this program running on a machine
with an ideal memory system is 1. What is the
CPI of the program on the machine with the
specified cache configuration? - 2.0
- 2.6
C. 3.0 D. 3.6
E. 4.0
29Cache Performance (contd)
Overall CPI base or useful CPI stall cycle
CPI Stall cycle CPI stall CPI of icache stall
CPI of dcache Assume 100 instructions Icache
stalls 95 hit rate 5 misses per 100
instructions 5 misses 20 cycles/miss 100
stall cycles 100 stall cycles / 100 instructions
1.0 CPI worth of icache stall Dcache
stalls 90 hit rate, 30 lw/sw 3 misses per
100 instructions 3 misses 20 cycles/miss 60
stall cycles 60 stall cycles / 100 instructions
0.6 CPI worth of dcache stall Stall cycle CPI
1.0 0.6 1.6 Overall CPI 1 1.6 2.6
30Cache Performance
- Given the same 200 MHz processor with 8KB
instruction and data caches, with memory access
latency of 20 cycles. Both caches are 2-way
associative. A programming running on this
processor has a 95 icache hit rate and a 90
dcache hit rate. On average, 30 of the
instructions are loads or stores. Suppose you
have 2 options for the next generation processor - Option 1 Double the clock frequency, this will
increase your memory latency to 50 cycles.
Assume a base CPI of 1 can still be achieved
after this change. - Option 2 Double the size of your caches, this
will increase the instruction cache hit rate to
98 and the data cache hit rate to 95. Assume
the hit latency is still 1 cycle.
31Cache Performance ( contd)
- Which of the following is true about these
choices? - Option 1 improves the program execution time by a
larger amount despite increasing CPI - Option 2 improves the program execution time by a
larger amount despite increasing CPI - Option 1 improves the program execution time by a
larger amount by decreasing CPI - Option 2 improves the program execution time by a
larger amount by decreasing CPI - The company building this processor will be
bankrupt soon because both of these options are
lame designs
32Cache Performance (contd)
Base CPI from previous question was 2.6 CPI
base CPI icache stall CPI dcache stall
CPI Execution time CPI num instructions
cycle time Base cycle time 1/200MHz
1/200106 5 10-9 5 ns Option 1 (double
clock freq, so cycle time is 2.5 ns) CPI 1.0
(5 50)/100 (3 50)/100 5.0 Execution time
5.0 I 2.5ns 12.5 I Option 2 CPI 1.0
(2 20)/100 (1.520)/100 1.7 Execution
time 1.7 I 5ns 8.5 I Option 2 is the
better choice. Achieved CPI is 1.7 compared with
the original value of 2.6, hence (d) is the right
answer.
33Caches/Virtual memory
- Which of the following is an architectural
parameter (ie what the machine language sees) in
a processor like the Mips? - Cache size
- Page size
- Physical memory address width
- Virtual memory address width
- Page table size
34Virtual Memory
- Which item is in a TLB entry but NOT in a page
table entry? - Virtual page number
- Physical page number
- Dirty bit
- Access permissions
- Valid bit
35Virtual Memory
- Which of the following is TRUE about
virtual/physical addressed caches? - In a virtual addressed caches, the TLB lookup
must be performed first - Physical addressed caches can be accessed faster
than virtual addressed caches - Virtual address caches handle virtual addressing
aliasing between 2 processes in a more effective
manner - In a physical addressed cache, the bits from the
virtual addressed are used to do the tag match - All of these are false
36Virtual Memory
- Given a system with multiple processes and a
direct-mapped cache. Which cache is likely to
have more conflict misses? - Physically addressed
- Virtually addressed
- Depends on the application
- This question is too hard to deserve an answer
- I hate 370 questions
Note, in the review, (c ) was the correct answer
as the question was incomplete. I added the
first sentence to making the problem now solvable
as 2 processes can utilize the same virtual
addresses for different data, hence you can get
conflicts between 2 processes utilizing the same
cache and thus more conflict misses.
37Virtual Memory
- Given the following 32-bit virtual addresses,
byte addressable, 16KB pages, maximal physical
memory of 128 MB. Ignoring all overhead (valid
bits, etc), how large is your page table? Round
your page table entry size up to the nearest byte
boundary for calculation. - 218
- 219
- 220
- 221
- 222
Length of page table num virtual pages Num
virtual pages num virtual bytes / page size
232 / 214
218 Width of page table log2(num physical
pages) valid bit, etc Num physical pages num
physical bytes / page size
227 / 214 213 Width log2(213)
13 bits Ignoring overhead and rounding up to
nearest byte, width 2 bytes Page table size
length width 218 2 219 bytes
38Virtual Memory
- Suppose you are implementing a 2-level
hierarchical page table. The architecture has a
32-bit virtual address and is byte addressable.
Your design has 4KB page, 4 byte page table
entry, and 1024 entries per second-level page
table. How many entries are in your super page
table? - A. 1k
- B. 2k
- C. 4k
- D. 8k
- E. 16k
39Virtual Memory (contd)
Num virtual pages num virtual bytes / page size
232 / 212 220 Each 2nd level page table
has 1024 (210) entries, thus need 1024 of
them So that you have 1 entry per virtual page,
ie 220 / 210 210 1024 Now, there are 1024
second-level page tables, each consisting of 1024
entries. The super page table needs 1 pointer to
each second-level page table, hence it also needs
1024 entries, or 1k entries. The size of the
super PT and the second Level PTs do NOT have to
be the same, it just worked out that way in this
example. FYI, total area of the set of page
tables ignoring overhead bits area of
super page table area of second-level page
tables Total area 102410 bits/entry 1024
tables (1024 entries 4 byte/entry 8
bits/byte) 33564672 bits