Exam Review - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Exam Review

Description:

Cycle 16: DSUBU is stalled at ID to avoid contention withADD.D for the ... stall // another stall comes due to WB structural hazard. sf f6, 0(r3) addi r1,r1,#8 ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 24

Provided by: venkat3

Category:

more less

Transcript and Presenter's Notes

Title: Exam Review

1
Exam Review
2
Homework

HW1
Solutions to be posted
HW2
Solutions to be posted
HW3
Solutions to be posted

3
Exam Format

Closed book
Can bring one sheet of notes (Can write on both
sides)
Calculators are allowed
Computers are not allowed
Timings 7.00 PM to 9.30 PM Wednesday, 3/7/07
Questions (Subject to change)
True or False (Need explanation)
Short answer questions
Exercise questions (resembling homework problems)
Total 5 to 6 questions (multiple parts)

4
Midterm Review

Material Covered
Chapters 1-4 and Appendix A of textbook
Associated homework problems
Associated lecture notes

5
Fundamentals of Computer Design

Important material from Chapter 1 includes
Trends in computer usage
Trends in technology
Computer costs and price
Computer performance and execution time
Ahmdahls law
Benchmarks
Computer performance equation
CPU Time IC CPI Clock Cycle Time
Poor performance metrics (MFLOPS, MIPS)

6
Instruction Set Design

Important material from Chapter 2 includes
ISA classifications (accumulator, stack,etc)
Memory addressing
Operations
Operand specifiers
Instruction set formats
Compilers and ISA
DLX architecture

7
Pipelining

Important material from Appendix A includes
Basic issues in pipelining (stages, throughput,
latency)
Multiple-cycle and pipelined DLX
Pipeline performance
Pipeline hazards (structural, data, control)
Coping with hazards (stalls, bypassing, branch
prediction)
Floating point pipelines

8
Instruction Level Parallelism

Important material from Chapter 3 and Chapter 4
includes
Dynamic techniques
Scoreboard
Tomasulos approach
Static techniques
Compiler scheduling
Loop unrolling
Superscalar, VLIW approaches
Branch prediction algorithms

9
Problem A.2 Review(Solution on Page B-35-B38)

Loop L.D. F0, 0(R2)
L.D. F4, 0(R3)
MUL.D F0, F0, F4
ADD.D F2, F0, F2
DADDUI R2, R2, 8
DADDUI R3, R3, 8
DSUBU R5, R4, R2
BNEZ R5, Loop
Initial value of R4 R2 792
Standard 5-stage integer pipeline MIPS FP
pipeline
Structural hazards due to WB contention Earlier
instructions gets priority
NEED PIPELINE
Without any forwarding (register file forwarding
in same cycle is ok). Branch is handled by
flushing the pipeline
With forwarding and assume branch is handled by
predicting it as not taken

Latency in cycles Integer ALU 0 Data memory 1 FP
Add 3 Multiply 6 FP divide 24
10
Solution

Note Latencies are applicable when data is
forwarded to EX stage
Code iterates 99 times (792/8 99)

11
No forwarding Branches flush Pipeline

F, D, E, M, and W denote the fetch, decode,
execute, memory access and write-back stages
Stalls
Cycle22 Register file can read and write in same
cycle ? resolve in cycle 22
Cycle 20 Handle uncertainty on where to fetch by
flushing the pipeline correct fetch in cycle 23
No structural hazard stalls due to write-back
contention
Two instructions simultaneously in execution
different Fus no structural hazard
Total loop execution 99 iterations X 22
cycles/iteration 2178 cycles

12
Forwarding Branch by Predicted-not-taken

F, D, E, M, and W denote the fetch, decode,
execute, memory access and write-back stages
Cycle 16 DSUBU is stalled at ID to avoid
contention withADD.D for the WB stage
Cycle 17 Branches are predicted not taken and
fetch is from fall-through location
But for the last loop iteration, this branch is
mispredicted. Fetch cycle is redone at cycle 19
Two instructions simultaneously in execution
different Fus no structural hazard
First iteration ends in cycle 19, the second
begins in 19
All except the last iteration takes 18 cycles.
Last takes 19 cycles (if code follows it is 16
cycles)
Loop execution time 9818 19 1783 cycles

13
Problem

For this problem, we will work with the following
SAXPY loop.
Do i 1, n
Zi AXi Yi
r1 indicates the starting address location of X
array, r2 indicates the start of Y array and r3
indicates the start of Z array MAX indicate the
final location address of X array that need to be
included in the loop
0 lf f0, 0(r1) // f at end of opcode
indicate the double or floating point
operation1 multf f4, f0, f2 //floating point
multip f2 contains scalar A2 lf f0, 0(r2)
3 addf f6,f4,f0 //floating point add4 sf
f6, 0(r3) 5 addi r1,r1,8 // integer add6
addi r2,r2,8 7 addi r3,r3,8 8 sgti r4,r1,
MAX //r41 if (r1 gt MAX) else r40 Integer/ALU
instruction9 beqz r4, loop
You will be asked to compute different schedules
for this loop, for several different machine
configurations. For the purposes of this
question, assume that branches are always
predicted as taken, multf latency is 4 cycles,
addf latency is 2 cycles, and all other
operations execute in a single cycle. For parts a
and b, assume no structural hazard (if not stated
otherwise, assume all functional units are in
multiple copies)
a. Consider a conventional DLX pipeline
with full forwarding and no provisions for
precise state (i.e., out-of-order writebacks are
allowed). Draw the pipeline diagram. How many
cycles does it take to execute a single iteration
of this loop? (i.e., from the completion of the
first instruction in one iteration to the
completion of the first instruction in the next
iteration)? Remember to respect all RAW hazards,
and label all stalls. Watch out for pipeline
violations (i.e., two instructions in the same
pipeline stage at the same time).
b. Reschedule the code to separate dependent
instructions from one another. Draw the pipeline
diagram for the rescheduled loop. How many
cycles do two unrolled iterations take?
c. Go back to the original (pre-unrolling)
loop. Consider a Scoreboard with 6 entries. The
machine has 5 functional units, 3 integer ALUs
that perform the ALU operations and the loads and
stores, and 2 FP units, which perform the addf
and multf functions. Fill in the instruction
status table for the scoreboard (IS, RO, EX, and
WB for each entry). However, rather than an X in
each slot, write the cycle number in which the
given instruction completes the particular
function. Consider the cycle in which the
instruction sgti is dispatched. Highlight the
completed functions at that cycle. Fill in the
state of the register status table and the
functional unit table, as it would be at the end
of this cycle. How many cycles does loop
iteration take to execute? (Again, count from the
end of the first instruction of one iteration to
the end of the first instruction from the next
iteration). Remember there is no bypassing.
d. Assume, in addition to the description in
part d, there is an additional constraint only
one instruction can read its operands or write
them back in any cycle, as there is only one bus
that can be used by the functional units to
access the register file. Assume the first
instruction waiting for the operands can read
(the availability of register info will be
communicated via dedicated and individual control
circuits so there is no contention there). Also
assume that bus is wide enough and the one
instruction can read as many registers as they
want in a given cycle.

14
Problem

DLX pipeline with full forwarding
Branches predicted-as-taken
Multf latency 4 cycles
Addf latency 2 cycles
Others 1 cycles (executes in 1 cycle)
Parts a and b Multiple Fus (No structural hazard)

15
Part a
17 cycles are needed to execute a single
iteration (all but the last iteration) and 18
cycles are needed for the last iteration. (The
last iteration takes 19 cycles if there is no
code after the instructions of the loop).
16
Part b

With stalls
lf f0, 0(r1) stall multf f4, f0, f2 lf f0,
0(r2) stall stall stall addf f6,f4,f0 stall
// another stall comes due to WB structural
hazard sf f6, 0(r3) addi r1,r1,8 addi
r2,r2,8 addi r3,r3,8 sgti r4,r1, MAX stall
beqz r4, loop
lf f0, 0(r1) multf f4, f0, f2 lf f0, 0(r2)
addf f6,f4,f0 sf f6, 0(r3) addi r1,r1,8
addi r2,r2,8 addi r3,r3,8 sgti r4,r1, MAX
beqz r4, loop
17
Part b rescheduled 2 iteration loop
lf f0, 0(r1) stall multf f4, f0, f2 lf t1,
0(r2) stall stall stall addf f6,f4,t1 stall
sf f6, 0(r3) addi r1,r1,8 addi r2,r2,8
addi r3,r3,8 sgti r4,r1, MAX stall beqz r4,
loop lf t2, 8(r1) stall multf t5, t2, f2 lf
t3, 8(r2) stall stall stall addf t4,t5,t3
stall sf t4, 8(r3) addi r1,r1,16 addi
r2,r2,16 addi r3,r3,16 sgti r4,r1, MAX
stall beqz r4, loop
lf f0, 0(r1) lf t1, 0(r2)multf f4, f0, f2 lf
t2, 8(r1) lf t3, 8(r2) multf t5, t2, f2 addi
r1,r1,16 addf f6,f4,t1 addi r2,r2,16sf f6,
0(r3) addf t4,t5,t3 sgti r4,r1, MAX sf t4,
8(r3) beqz r4, loop addi r3,r3,16
Unrolling Renaming Rescheduling
18
Unrolled - scheduled
Even though we did scheduling to avoid all data
hazards, the structural hazards (limited write
and memory ports) are causing delays. 19/2 9.5
cycles are needed for a single iteration (2
iterations 15 instructions 4 stalls) Note It
is possible to reschedule the loop further to
reduce or eliminate the stalls due to structural
hazards.
19
Part C Scoreboard - Notes
20
Part C Scoreboard
Loop iteration takes 20-4 16 cycles
21
Part D Scoreboard with a single bus to access
register file
Loop iteration takes 24-420 cycles
22
Example Short Questions

What is forwarding or bypassing?
Why do you need benchmark programs?
What are the main differences between RISC and
CISC?
What is Amdhals law?
CPI vs. IPC What is the relation?
Will pipeline approach help the instruction
latency?
Why Tomasulos algorithm is better than
scoreboard?
What are the advantages of compiler based
scheduling?
How do you measure the performance of a computer?

23
True or False questions with explanations

Deeper pipeline architectures yield higher
throughput.
Floating point based graphics routines are better
bench marks than many other programs.
Branch prediction algorithms are evaluated based
on their performance measured against bench mark
programs.
Dynamic scheduling causes more structural
hazards.
The relative performance of two processors with
the same instruction set architecture (ISA) can
be judged by clock rate or by the performance of
a single benchmark suite.
An architecture with flaws cannot be successful
Stack architecture is better than accumulator
architecture in conserving memory bus traffic.
Computer with higher clock-speed is likely to
provide better performance