Title: Exam Review
1Exam Review
2Homework
- HW1
- Solutions to be posted
- HW2
- Solutions to be posted
- HW3
- Solutions to be posted
3Exam Format
- Closed book
- Can bring one sheet of notes (Can write on both
sides) - Calculators are allowed
- Computers are not allowed
- Timings 7.00 PM to 9.30 PM Wednesday, 3/7/07
- Questions (Subject to change)
- True or False (Need explanation)
- Short answer questions
- Exercise questions (resembling homework problems)
- Total 5 to 6 questions (multiple parts)
4Midterm Review
- Material Covered
- Chapters 1-4 and Appendix A of textbook
- Associated homework problems
- Associated lecture notes
5Fundamentals of Computer Design
- Important material from Chapter 1 includes
- Trends in computer usage
- Trends in technology
- Computer costs and price
- Computer performance and execution time
- Ahmdahls law
- Benchmarks
- Computer performance equation
- CPU Time IC CPI Clock Cycle Time
- Poor performance metrics (MFLOPS, MIPS)
6Instruction Set Design
- Important material from Chapter 2 includes
- ISA classifications (accumulator, stack,etc)
- Memory addressing
- Operations
- Operand specifiers
- Instruction set formats
- Compilers and ISA
- DLX architecture
7Pipelining
- Important material from Appendix A includes
- Basic issues in pipelining (stages, throughput,
latency) - Multiple-cycle and pipelined DLX
- Pipeline performance
- Pipeline hazards (structural, data, control)
- Coping with hazards (stalls, bypassing, branch
prediction) - Floating point pipelines
8Instruction Level Parallelism
- Important material from Chapter 3 and Chapter 4
includes - Dynamic techniques
- Scoreboard
- Tomasulos approach
- Static techniques
- Compiler scheduling
- Loop unrolling
- Superscalar, VLIW approaches
- Branch prediction algorithms
9Problem A.2 Review(Solution on Page B-35-B38)
- Loop L.D. F0, 0(R2)
- L.D. F4, 0(R3)
- MUL.D F0, F0, F4
- ADD.D F2, F0, F2
- DADDUI R2, R2, 8
- DADDUI R3, R3, 8
- DSUBU R5, R4, R2
- BNEZ R5, Loop
- Initial value of R4 R2 792
- Standard 5-stage integer pipeline MIPS FP
pipeline - Structural hazards due to WB contention Earlier
instructions gets priority - NEED PIPELINE
- Without any forwarding (register file forwarding
in same cycle is ok). Branch is handled by
flushing the pipeline - With forwarding and assume branch is handled by
predicting it as not taken
Latency in cycles Integer ALU 0 Data memory 1 FP
Add 3 Multiply 6 FP divide 24
10Solution
- Note Latencies are applicable when data is
forwarded to EX stage - Code iterates 99 times (792/8 99)
11No forwarding Branches flush Pipeline
- F, D, E, M, and W denote the fetch, decode,
execute, memory access and write-back stages - Stalls
- Cycle22 Register file can read and write in same
cycle ? resolve in cycle 22 - Cycle 20 Handle uncertainty on where to fetch by
flushing the pipeline correct fetch in cycle 23 - No structural hazard stalls due to write-back
contention - Two instructions simultaneously in execution
different Fus no structural hazard - Total loop execution 99 iterations X 22
cycles/iteration 2178 cycles
12Forwarding Branch by Predicted-not-taken
- F, D, E, M, and W denote the fetch, decode,
execute, memory access and write-back stages - Cycle 16 DSUBU is stalled at ID to avoid
contention withADD.D for the WB stage - Cycle 17 Branches are predicted not taken and
fetch is from fall-through location - But for the last loop iteration, this branch is
mispredicted. Fetch cycle is redone at cycle 19 - Two instructions simultaneously in execution
different Fus no structural hazard - First iteration ends in cycle 19, the second
begins in 19 - All except the last iteration takes 18 cycles.
Last takes 19 cycles (if code follows it is 16
cycles) - Loop execution time 9818 19 1783 cycles
13Problem
- For this problem, we will work with the following
SAXPY loop. - Do i 1, n
- Zi AXi Yi
- r1 indicates the starting address location of X
array, r2 indicates the start of Y array and r3
indicates the start of Z array MAX indicate the
final location address of X array that need to be
included in the loop - 0 lf f0, 0(r1) // f at end of opcode
indicate the double or floating point
operation1 multf f4, f0, f2 //floating point
multip f2 contains scalar A2 lf f0, 0(r2)
3 addf f6,f4,f0 //floating point add4 sf
f6, 0(r3) 5 addi r1,r1,8 // integer add6
addi r2,r2,8 7 addi r3,r3,8 8 sgti r4,r1,
MAX //r41 if (r1 gt MAX) else r40 Integer/ALU
instruction9 beqz r4, loop - You will be asked to compute different schedules
for this loop, for several different machine
configurations. For the purposes of this
question, assume that branches are always
predicted as taken, multf latency is 4 cycles,
addf latency is 2 cycles, and all other
operations execute in a single cycle. For parts a
and b, assume no structural hazard (if not stated
otherwise, assume all functional units are in
multiple copies) - a. Consider a conventional DLX pipeline
with full forwarding and no provisions for
precise state (i.e., out-of-order writebacks are
allowed). Draw the pipeline diagram. How many
cycles does it take to execute a single iteration
of this loop? (i.e., from the completion of the
first instruction in one iteration to the
completion of the first instruction in the next
iteration)? Remember to respect all RAW hazards,
and label all stalls. Watch out for pipeline
violations (i.e., two instructions in the same
pipeline stage at the same time). - b. Reschedule the code to separate dependent
instructions from one another. Draw the pipeline
diagram for the rescheduled loop. How many
cycles do two unrolled iterations take? - c. Go back to the original (pre-unrolling)
loop. Consider a Scoreboard with 6 entries. The
machine has 5 functional units, 3 integer ALUs
that perform the ALU operations and the loads and
stores, and 2 FP units, which perform the addf
and multf functions. Fill in the instruction
status table for the scoreboard (IS, RO, EX, and
WB for each entry). However, rather than an X in
each slot, write the cycle number in which the
given instruction completes the particular
function. Consider the cycle in which the
instruction sgti is dispatched. Highlight the
completed functions at that cycle. Fill in the
state of the register status table and the
functional unit table, as it would be at the end
of this cycle. How many cycles does loop
iteration take to execute? (Again, count from the
end of the first instruction of one iteration to
the end of the first instruction from the next
iteration). Remember there is no bypassing. - d. Assume, in addition to the description in
part d, there is an additional constraint only
one instruction can read its operands or write
them back in any cycle, as there is only one bus
that can be used by the functional units to
access the register file. Assume the first
instruction waiting for the operands can read
(the availability of register info will be
communicated via dedicated and individual control
circuits so there is no contention there). Also
assume that bus is wide enough and the one
instruction can read as many registers as they
want in a given cycle.
14Problem
- DLX pipeline with full forwarding
- Branches predicted-as-taken
- Multf latency 4 cycles
- Addf latency 2 cycles
- Others 1 cycles (executes in 1 cycle)
- Parts a and b Multiple Fus (No structural hazard)
15Part a
17 cycles are needed to execute a single
iteration (all but the last iteration) and 18
cycles are needed for the last iteration. (The
last iteration takes 19 cycles if there is no
code after the instructions of the loop).
16Part b
With stalls
lf f0, 0(r1) stall multf f4, f0, f2 lf f0,
0(r2) stall stall stall addf f6,f4,f0 stall
// another stall comes due to WB structural
hazard sf f6, 0(r3) addi r1,r1,8 addi
r2,r2,8 addi r3,r3,8 sgti r4,r1, MAX stall
beqz r4, loop
lf f0, 0(r1) multf f4, f0, f2 lf f0, 0(r2)
addf f6,f4,f0 sf f6, 0(r3) addi r1,r1,8
addi r2,r2,8 addi r3,r3,8 sgti r4,r1, MAX
beqz r4, loop
17Part b rescheduled 2 iteration loop
lf f0, 0(r1) stall multf f4, f0, f2 lf t1,
0(r2) stall stall stall addf f6,f4,t1 stall
sf f6, 0(r3) addi r1,r1,8 addi r2,r2,8
addi r3,r3,8 sgti r4,r1, MAX stall beqz r4,
loop lf t2, 8(r1) stall multf t5, t2, f2 lf
t3, 8(r2) stall stall stall addf t4,t5,t3
stall sf t4, 8(r3) addi r1,r1,16 addi
r2,r2,16 addi r3,r3,16 sgti r4,r1, MAX
stall beqz r4, loop
lf f0, 0(r1) lf t1, 0(r2)multf f4, f0, f2 lf
t2, 8(r1) lf t3, 8(r2) multf t5, t2, f2 addi
r1,r1,16 addf f6,f4,t1 addi r2,r2,16sf f6,
0(r3) addf t4,t5,t3 sgti r4,r1, MAX sf t4,
8(r3) beqz r4, loop addi r3,r3,16
Unrolling Renaming Rescheduling
18Unrolled - scheduled
Even though we did scheduling to avoid all data
hazards, the structural hazards (limited write
and memory ports) are causing delays. 19/2 9.5
cycles are needed for a single iteration (2
iterations 15 instructions 4 stalls) Note It
is possible to reschedule the loop further to
reduce or eliminate the stalls due to structural
hazards.
19Part C Scoreboard - Notes
20Part C Scoreboard
Loop iteration takes 20-4 16 cycles
21Part D Scoreboard with a single bus to access
register file
Loop iteration takes 24-420 cycles
22Example Short Questions
- What is forwarding or bypassing?
- Why do you need benchmark programs?
- What are the main differences between RISC and
CISC? - What is Amdhals law?
- CPI vs. IPC What is the relation?
- Will pipeline approach help the instruction
latency? - Why Tomasulos algorithm is better than
scoreboard? - What are the advantages of compiler based
scheduling? - How do you measure the performance of a computer?
23True or False questions with explanations
- Deeper pipeline architectures yield higher
throughput. - Floating point based graphics routines are better
bench marks than many other programs. - Branch prediction algorithms are evaluated based
on their performance measured against bench mark
programs. - Dynamic scheduling causes more structural
hazards. - The relative performance of two processors with
the same instruction set architecture (ISA) can
be judged by clock rate or by the performance of
a single benchmark suite. - An architecture with flaws cannot be successful
- Stack architecture is better than accumulator
architecture in conserving memory bus traffic. - Computer with higher clock-speed is likely to
provide better performance