EECSCS 370 - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

EECSCS 370

Description:

An entry consists of a block of data and the associated tag, valid, dirty bits. ... How many dirty blocks are evicted when executing the instruction sequence? ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 40
Provided by: garyt
Category:
Tags: eecscs | dirty

less

Transcript and Presenter's Notes

Title: EECSCS 370


1
EECS/CS 370
  • Review for Exam 3
  • Nov 14, 2001

2
The Facts
  • 50 minute exam
  • Friday, Nov 16, 2001 240pm 330pm
  • Dow 1013, 1014, 1018, 1005
  • EECS 1200, 1003, 1301
  • Closed Book Closed Note
  • Calculators are allowed (for arithmetic)

3
Exam Format
  • 15 multiple choice questions
  • 4 fairly easy questions
  • 8 medium difficulty (and more time consuming)
  • 3 are difficult (extending material covered in
    class)
  • Length probably will be a long exam
  • Dont spend all your time on 1 question, if you
    are stuck, skip it and move on!!!!!!!!!!!!
  • No figures as last time

4
Important Topics
  • Project 3 (hazards, forwarding, etc.)
  • Advanced pipelining topics (concepts not details)
  • Superscalar, superpipelining, out-of-order
    execution
  • Details of Pentium, AMD not important
  • Caches, caches, caches (MOST important topic)
  • Reference stream, associativity, write policy,
    performance
  • Virtual memory
  • Page table, TLB, virtual vs physically addressed
    caches

5
Other Stuff to Know
  • Pipeline organization for Project 3
  • LC2k1 assembly code (format, semantics)
  • You should know this by now, if not something is
    wrong with you ?
  • Concepts from previous exams that continue to
    apply are important (CPI, pipelining, hazards,
    etc.)

6
Notes
  • 25 sample questions follow, correct answers are
    indicated by a
  • Explanations of the answers are in blue
  • Cache line vs Cache set
  • I use the term line (and thus line index), the
    book uses the term set (and thus set index).
    These are the same thing. If you think of a
    cache as a matrix of data, rows are lines/sets.
    Within each row are k-entries where k is the
    associativity. An entry consists of a block of
    data and the associated tag, valid, dirty bits.
    Each line/set has 1 set of bits used for the LRU
    replacement.

7
Adv pipelining - Easy
  • Which of the following are FALSE about the term
    superpipelining
  • It generally refers to processor implementations
    with more than 5 pipeline stages
  • It often enables the clock frequency to be
    increased
  • It often reduces the number of data hazards
  • It likely increases the penalty for mispredicted
    branches
  • The Pentium-4 is an example of a superpipelined
    processor
  • It reduces CPI

8
Adv pipelining
  • Superscalar implementations generally increase
    performance by what?
  • Reducing CPI
  • Increasing clock frequency
  • Reducing the number of instructions executed
  • Reducing the number of control hazards
  • Increasing the number of pipeline stages

9
Adv pipelining
  • In an out-of-order execution processor, whats the
    primary function of the reorder buffer?
  • Fetch multiple instructions from memory to
    support superscalar execution
  • Reorder instructions to resolve data hazards
  • Check for hazards between instructions to
    determine what can issue in parallel
  • Maintain the original sequential order of
    instructions through execution and retirement
  • Sequentialize instructions in the order they
    finish execution to prepare for retirement

10
Adv pipelining (harder)
  • Supposed you wanted to modify the LC2k1 pipelined
    processor used in Project 3 to execute 2
    instructions at the same time by having 2
    parallel pipelines where pairs of sequential
    instructions will be executed in the pipelines.
    How many INTER-PIPELINE data hazard checks will
    be necessary to make this work properly?
  • 6
  • 10
  • 14
  • 18
  • 22

Consider a single pipeline. Need to check an
instructions source operands with the
destinations of the 3 previous instructions,
hence 6 checks. Now consider with 2 pipelines.
Focus on pipeline1, you need to check both
sources of an instruction with the destinations
of the 3 previous instructions in BOTH pipelines.
That is, 12 total checks, 6 are
inter-pipeline. Same for pipeline 2, so 12 of
total inter-pipeline checks so far. Now you
also cannot execute 2 instructions in parallel if
the first defines say r3 and the second uses r3,
because you wont be able to forward properly.
So, 2 more inter-pipeline checks are needed to
check if the sources of the instruction
in pipeline2 match the destination of the
instruction in pipeline1. Total 14
11
Adv pipelining (contd)
  • One note about the previous problem. The opcode
    specificness of some of the hazard checks has
    been ignored such as a lws destination is regB
    whereas an adds is destReg. But for a question
    like this, think about it from the hardware
    perspective, i.e., how many comparators do you
    need?, rather than from a programming perspective
    where you think about how many if-statements you
    need. And by need, comparators cost , so put
    in as few as possible. You are allowed to MUX
    different inputs into those comparators based on
    opcodes. So, you really only need to do a
    maximum of 14 inter-pipeline comparisons each
    cycle to detect hazards.

12
Project 3 medium
  • For the LC2k1 pipeline simulator you wrote for
    Project 3, suppose you wanted to test your
    simulator to make sure the data forwarding from
    the MEM/WB pipeline register to the EX stage was
    working properly. Which of the following
    testcases would expose a bug in this part of your
    code? Note, you may assume the simulator is
    operating correctly with the exception of the
    forwarding path being tested. Further forwarding
    to either source operand is considered a
    sufficient test case. For the testcases, assume
    initial register contents of r1, r22, r33, all
    the rest are 0. Further assume the contents of
    memory location 10 is 10.

13
Project 3 (contd)
(a)
(b)
(d)
(c)
add 2 2 2 add 2 2 3 nand 3 4 5 lw 5 3 11
add 1 2 3 lw 3 4 7 sw 2 4 7 beq 3 3 2
lw 2 4 8 add 2 2 2 add 2 2 2 sw 2 4 2
beq 3 3 1 add 1 2 4 add 1 3 5 add 4 5 6
(e) Multiple of these tecases can be used to
expose the bug
  • To exercise the forwarding path from the MEM/WB
    pipeline stage,
  • you need a def followed by a use that is 2
    instructions later, i.e.,
  • def
  • some instruction
  • use of the def
  • Is not correct because all def/use pairs are
    consecutive instructions
  • Lw followed by immediate use (sw) causes a noop
    to be inserted,
  • then the lw will forward its result to
    the sw from MEM/WB as desired
  • (c) Is not correct, all def/use pairs are
    consecutive or 3 instructions apart
  • Beq is taken so instruction 2 is not executed.
    Only consecutive def/use
  • with instruction 2 not executed, hence
    not correct

14
Caches - Easy
  • In order to reduce compulsory misses of a
    direct-mapped, write-through cache with LRU
    replacement, which technique in general is the
    most effective?
  • Increase cache size
  • Increase block size
  • Increase cache associativity
  • Change the replacement policy from LRU to FIFO
  • Make the cache write-back

15
Caches
  • Direct-mapped caches are often undesirable
    because?
  • A larger number of parallel tag compares must be
    performed
  • Tag fields are larger making each tag compare
    more expensive
  • Cannot support write-back designs, thus they have
    higher memory traffic
  • Higher miss rates are suffered because of more
    conflict misses
  • Higher miss rates are suffered because of more
    capacity misses

16
Caches
  • Which of the following statements is FALSE about
    the principle of temporal locality?
  • If a program references location 2367, its more
    likely to reference that location again than any
    other random location
  • Any data that is referenced and is not in the
    cache, should be put into the cache
  • LRU should be utilized as the replacement policy
  • Increasing block size makes use of the principle
    to a larger extent
  • All of these are true statements

17
Caches
  • What happens to the tag field when the cache
    associativity goes from 4 to 2?
  • Tag size increases by 1 bit
  • Tag size decreases by 1 bit
  • Tag size increases by 2 bits
  • Tag size decreases by 2 bits
  • Tag size is unchanged, but the bits that are used
    for the tag are changed

By halving the associativity, the number of
cache lines is doubled, increasing the line index
field by 1 bit. Hence the tag is reduced by 1
bit, as address bits must be conserved.
18
Caches
  • With the following processor configuration, how
    many lines does the cache have?
  • 24
  • 26
  • 28
  • 210
  • 212

32-bit memory addresses Byte addressable 16KB
cache, 4-way associative 64-byte blocks
cache 16KB 214 bytes num blocks in cache
size in bytes / num bytes per block
214 / 26 28 num lines
in cache num blocks / associativity (as each
line contains k
blocks for a k-way assocative cache
28 / 4 26, hence B
19
Caches (harder)
  • Consider the following cache configuration and
    instruction stream. This information applies to
    the next 7 questions.

Instruction stream
Cache configuration
Size 32 bytes Block size 4 bytes Write policy
write-back, write-allocate 2-way associative LRU
replacement 8-bit, byte-addressable addresses
Ref1 load to addr 10001001 Ref2 load to addr
10001110 Ref3 store to addr 10001101 Ref4 load
to addr 10101111 Ref5 store to addr
11001110 Ref6 store to addr 10001010 Ref7 load
to addr 11111100 Ref8 store to addr 00001111
20
Instruction Execution Analysis
First calculate the breakdown of the address
bits Block size 4 bytes 2 bit block
offset Num blocks in cache 32 bytes / 4 bytes
per block 8 2 way set assoc means 2 blocks per
line (or set) Number lines 8 blocks / 2 blocks
per line 4 2 bits for line index Remaining
bits are tag, 8 2 2 4 bits
Instruction execution Ref1 miss, insert into
line 10, way 0 Ref2 miss, insert into line 11,
way 0 Ref3 hit into line 11, way 0, mark as
dirty Ref4 miss, insert into line 11, way
1 Ref5, miss, evict line 11, way 0
block is dirty, write back to mem
insert into line 11, way 0, mark as dirty Ref6,
hit into line 10, way 0, mark as dirty Ref7,
miss, evict line 11, way 1 block is
clean, so no write back insert into
line 11, way 1 Ref8, miss, evict line 11, way 0
block is dirty, so write back to mem
insert into line 11, way 0
Address breakdown
4
2
2
tag
line index
block offset
Cache organization
way0
way1
v d tag
data
v d tag
data
Line 00
v d tag
data
v d tag
data
Line 01
Line 10
v d tag
data
v d tag
data
Line 11
v d tag
data
v d tag
data
data 4 byte block, also have 1 LRU bit per line
21
Caches
  • How many cache hits are obtained when executing
    this instruction sequence?
  • 1
  • 2
  • 3
  • 4
  • 5

22
Caches
  • How many dirty blocks are evicted when executing
    the instruction sequence? Note dont count any
    dirty blocks remaining after the instruction
    sequence is completed.
  • 0
  • 1
  • 2
  • 3
  • 4

23
Caches
  • How many bytes of data are transferred to/from
    the memory during the execution of the
    instruction sequence? Again dont count the
    eviction of any dirty block that remains after
    the instruction sequence is completed.
  • 20
  • 24
  • 28
  • 32
  • 36

Memory traffic memory transfers to replace
evicted blocks memory transfers to
writeback dirty blocks Evicted blocks
transfers 6 cache misses 1 block transfer per
miss 4 bytes/block 24 bytes Dirty writeback
transfers 2 dirty writebacks 1 block transfer
per writeback 4 bytes/block 8 bytes Total
24 8 32 bytes
24
Caches
  • Whats the tag of the second block that is
    replaced in the cache during the execution of the
    instruction sequence
  • 1000
  • 100011
  • 1010
  • 101011
  • 1100

Note, I gave the answer (A) during the review.
That is wrong, the correct answer is (C).
25
Caches
  • How many BITS are required to construct the cache
    (including data, tags, valid bits, LRU, etc)?
  • 252
  • 280
  • 292
  • 308
  • 324

Total data storage overhead (ie tag, valid,
dirty, LRU) data storage 32 bytes 8 bits/byte
256 bits Overhead 4 lines 2 ways per line
(4 bits of tag 1 valid
bit 1 dirty bit) 4 lines
1 LRU bit per line 52 Total
256 52 308
26
Caches
  • If the cache is made write-through, which of the
    following is True when executing the instruction
    sequence?
  • Number of hits increases
  • Number of hits decreases
  • Amount of memory traffic increases
  • Amount of memory traffic decreases
  • Both hits and memory traffic stay constant

Remember, write-through/write-back does not
affect the number of hits/misses. The same data
is brought into the cache whether you have
write-through or write-back. In general,
write-through causes the memory traffic to
increase as each store causes a change in
memory. However in this example, there are 4
stores, each causes 1 byte of memory to be
written, or a total of 4 bytes of dirty data
written to memory. In the original write-back
strategy, 2 dirty blocks were replaced causing 2
block updates to memory, or 2 4 bytes 8 bytes
of dirty data written back to memory. Memory
traffic includes these memory updates memory
transfers for cache misses. However, the number
of cache misses is constant for both strategies,
so writeback transfers 4 more bytes of data to
memory. Thus (d) is the correct answer.
27
Caches
  • If the cache is made no-write allocate, which of
    the following is True?
  • Number of store hits increases
  • Number of store hits decreases
  • Number of load hits increases
  • Number of load hits decreases
  • Number of blocks replaced decreases

You need to run through the entire execution of
the instruction sequence again for this question.
The same references achieve hits/misses as they
did before, this is in general NOT true, just
happened here. The difference is store misses do
not allocate an entry in the cache, so Ref5 and
Ref8 used to cause evictions, this no longer
happens. Only 1 eviction results from this
execution, hence (e) is correct.
28
Cache Performance
  • Given a 200 MHz processor with 8KB data and 8KB
    instruction caches, the cost to access the main
    memory is 20 cycles. Both caches are 2-way
    associative. A program running on this processor
    has a 95 Icache hit rate and a 90 Dcache hit
    rate. On average, 30 of the instructions which
    execute are loads or stores. Assume that the
    base CPI for this program running on a machine
    with an ideal memory system is 1. What is the
    CPI of the program on the machine with the
    specified cache configuration?
  • 2.0
  • 2.6

C. 3.0 D. 3.6
E. 4.0
29
Cache Performance (contd)
Overall CPI base or useful CPI stall cycle
CPI Stall cycle CPI stall CPI of icache stall
CPI of dcache Assume 100 instructions Icache
stalls 95 hit rate 5 misses per 100
instructions 5 misses 20 cycles/miss 100
stall cycles 100 stall cycles / 100 instructions
1.0 CPI worth of icache stall Dcache
stalls 90 hit rate, 30 lw/sw 3 misses per
100 instructions 3 misses 20 cycles/miss 60
stall cycles 60 stall cycles / 100 instructions
0.6 CPI worth of dcache stall Stall cycle CPI
1.0 0.6 1.6 Overall CPI 1 1.6 2.6
30
Cache Performance
  • Given the same 200 MHz processor with 8KB
    instruction and data caches, with memory access
    latency of 20 cycles. Both caches are 2-way
    associative. A programming running on this
    processor has a 95 icache hit rate and a 90
    dcache hit rate. On average, 30 of the
    instructions are loads or stores. Suppose you
    have 2 options for the next generation processor
  • Option 1 Double the clock frequency, this will
    increase your memory latency to 50 cycles.
    Assume a base CPI of 1 can still be achieved
    after this change.
  • Option 2 Double the size of your caches, this
    will increase the instruction cache hit rate to
    98 and the data cache hit rate to 95. Assume
    the hit latency is still 1 cycle.

31
Cache Performance ( contd)
  • Which of the following is true about these
    choices?
  • Option 1 improves the program execution time by a
    larger amount despite increasing CPI
  • Option 2 improves the program execution time by a
    larger amount despite increasing CPI
  • Option 1 improves the program execution time by a
    larger amount by decreasing CPI
  • Option 2 improves the program execution time by a
    larger amount by decreasing CPI
  • The company building this processor will be
    bankrupt soon because both of these options are
    lame designs

32
Cache Performance (contd)
Base CPI from previous question was 2.6 CPI
base CPI icache stall CPI dcache stall
CPI Execution time CPI num instructions
cycle time Base cycle time 1/200MHz
1/200106 5 10-9 5 ns Option 1 (double
clock freq, so cycle time is 2.5 ns) CPI 1.0
(5 50)/100 (3 50)/100 5.0 Execution time
5.0 I 2.5ns 12.5 I Option 2 CPI 1.0
(2 20)/100 (1.520)/100 1.7 Execution
time 1.7 I 5ns 8.5 I Option 2 is the
better choice. Achieved CPI is 1.7 compared with
the original value of 2.6, hence (d) is the right
answer.
33
Caches/Virtual memory
  • Which of the following is an architectural
    parameter (ie what the machine language sees) in
    a processor like the Mips?
  • Cache size
  • Page size
  • Physical memory address width
  • Virtual memory address width
  • Page table size

34
Virtual Memory
  • Which item is in a TLB entry but NOT in a page
    table entry?
  • Virtual page number
  • Physical page number
  • Dirty bit
  • Access permissions
  • Valid bit

35
Virtual Memory
  • Which of the following is TRUE about
    virtual/physical addressed caches?
  • In a virtual addressed caches, the TLB lookup
    must be performed first
  • Physical addressed caches can be accessed faster
    than virtual addressed caches
  • Virtual address caches handle virtual addressing
    aliasing between 2 processes in a more effective
    manner
  • In a physical addressed cache, the bits from the
    virtual addressed are used to do the tag match
  • All of these are false

36
Virtual Memory
  • Given a system with multiple processes and a
    direct-mapped cache. Which cache is likely to
    have more conflict misses?
  • Physically addressed
  • Virtually addressed
  • Depends on the application
  • This question is too hard to deserve an answer
  • I hate 370 questions

Note, in the review, (c ) was the correct answer
as the question was incomplete. I added the
first sentence to making the problem now solvable
as 2 processes can utilize the same virtual
addresses for different data, hence you can get
conflicts between 2 processes utilizing the same
cache and thus more conflict misses.
37
Virtual Memory
  • Given the following 32-bit virtual addresses,
    byte addressable, 16KB pages, maximal physical
    memory of 128 MB. Ignoring all overhead (valid
    bits, etc), how large is your page table? Round
    your page table entry size up to the nearest byte
    boundary for calculation.
  • 218
  • 219
  • 220
  • 221
  • 222

Length of page table num virtual pages Num
virtual pages num virtual bytes / page size
232 / 214
218 Width of page table log2(num physical
pages) valid bit, etc Num physical pages num
physical bytes / page size
227 / 214 213 Width log2(213)
13 bits Ignoring overhead and rounding up to
nearest byte, width 2 bytes Page table size
length width 218 2 219 bytes
38
Virtual Memory
  • Suppose you are implementing a 2-level
    hierarchical page table. The architecture has a
    32-bit virtual address and is byte addressable.
    Your design has 4KB page, 4 byte page table
    entry, and 1024 entries per second-level page
    table. How many entries are in your super page
    table?
  • A. 1k
  • B. 2k
  • C. 4k
  • D. 8k
  • E. 16k

39
Virtual Memory (contd)
Num virtual pages num virtual bytes / page size
232 / 212 220 Each 2nd level page table
has 1024 (210) entries, thus need 1024 of
them So that you have 1 entry per virtual page,
ie 220 / 210 210 1024 Now, there are 1024
second-level page tables, each consisting of 1024
entries. The super page table needs 1 pointer to
each second-level page table, hence it also needs
1024 entries, or 1k entries. The size of the
super PT and the second Level PTs do NOT have to
be the same, it just worked out that way in this
example. FYI, total area of the set of page
tables ignoring overhead bits area of
super page table area of second-level page
tables Total area 102410 bits/entry 1024
tables (1024 entries 4 byte/entry 8
bits/byte) 33564672 bits
Write a Comment
User Comments (0)
About PowerShow.com