Exam Review - PowerPoint PPT Presentation

1 / 23
About This Presentation

Exam Review


Cycle 16: DSUBU is stalled at ID to avoid contention withADD.D for the ... stall // another stall comes due to WB structural hazard. sf f6, 0(r3) addi r1,r1,#8 ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 24
Provided by: venkat3
Tags: exam | review | stall


Transcript and Presenter's Notes

Title: Exam Review

Exam Review
  • HW1
  • Solutions to be posted
  • HW2
  • Solutions to be posted
  • HW3
  • Solutions to be posted

Exam Format
  • Closed book
  • Can bring one sheet of notes (Can write on both
  • Calculators are allowed
  • Computers are not allowed
  • Timings 7.00 PM to 9.30 PM Wednesday, 3/7/07
  • Questions (Subject to change)
  • True or False (Need explanation)
  • Short answer questions
  • Exercise questions (resembling homework problems)
  • Total 5 to 6 questions (multiple parts)

Midterm Review
  • Material Covered
  • Chapters 1-4 and Appendix A of textbook
  • Associated homework problems
  • Associated lecture notes

Fundamentals of Computer Design
  • Important material from Chapter 1 includes
  • Trends in computer usage
  • Trends in technology
  • Computer costs and price
  • Computer performance and execution time
  • Ahmdahls law
  • Benchmarks
  • Computer performance equation
  • CPU Time IC CPI Clock Cycle Time
  • Poor performance metrics (MFLOPS, MIPS)

Instruction Set Design
  • Important material from Chapter 2 includes
  • ISA classifications (accumulator, stack,etc)
  • Memory addressing
  • Operations
  • Operand specifiers
  • Instruction set formats
  • Compilers and ISA
  • DLX architecture

  • Important material from Appendix A includes
  • Basic issues in pipelining (stages, throughput,
  • Multiple-cycle and pipelined DLX
  • Pipeline performance
  • Pipeline hazards (structural, data, control)
  • Coping with hazards (stalls, bypassing, branch
  • Floating point pipelines

Instruction Level Parallelism
  • Important material from Chapter 3 and Chapter 4
  • Dynamic techniques
  • Scoreboard
  • Tomasulos approach
  • Static techniques
  • Compiler scheduling
  • Loop unrolling
  • Superscalar, VLIW approaches
  • Branch prediction algorithms

Problem A.2 Review(Solution on Page B-35-B38)
  • Loop L.D. F0, 0(R2)
  • L.D. F4, 0(R3)
  • MUL.D F0, F0, F4
  • ADD.D F2, F0, F2
  • DADDUI R2, R2, 8
  • DADDUI R3, R3, 8
  • DSUBU R5, R4, R2
  • BNEZ R5, Loop
  • Initial value of R4 R2 792
  • Standard 5-stage integer pipeline MIPS FP
  • Structural hazards due to WB contention Earlier
    instructions gets priority
  • Without any forwarding (register file forwarding
    in same cycle is ok). Branch is handled by
    flushing the pipeline
  • With forwarding and assume branch is handled by
    predicting it as not taken

Latency in cycles Integer ALU 0 Data memory 1 FP
Add 3 Multiply 6 FP divide 24
  • Note Latencies are applicable when data is
    forwarded to EX stage
  • Code iterates 99 times (792/8 99)

No forwarding Branches flush Pipeline
  • F, D, E, M, and W denote the fetch, decode,
    execute, memory access and write-back stages
  • Stalls
  • Cycle22 Register file can read and write in same
    cycle ? resolve in cycle 22
  • Cycle 20 Handle uncertainty on where to fetch by
    flushing the pipeline correct fetch in cycle 23
  • No structural hazard stalls due to write-back
  • Two instructions simultaneously in execution
    different Fus no structural hazard
  • Total loop execution 99 iterations X 22
    cycles/iteration 2178 cycles

Forwarding Branch by Predicted-not-taken
  • F, D, E, M, and W denote the fetch, decode,
    execute, memory access and write-back stages
  • Cycle 16 DSUBU is stalled at ID to avoid
    contention withADD.D for the WB stage
  • Cycle 17 Branches are predicted not taken and
    fetch is from fall-through location
  • But for the last loop iteration, this branch is
    mispredicted. Fetch cycle is redone at cycle 19
  • Two instructions simultaneously in execution
    different Fus no structural hazard
  • First iteration ends in cycle 19, the second
    begins in 19
  • All except the last iteration takes 18 cycles.
    Last takes 19 cycles (if code follows it is 16
  • Loop execution time 9818 19 1783 cycles

  • For this problem, we will work with the following
    SAXPY loop.
  • Do i 1, n
  • Zi AXi Yi
  • r1 indicates the starting address location of X
    array, r2 indicates the start of Y array and r3
    indicates the start of Z array MAX indicate the
    final location address of X array that need to be
    included in the loop
  • 0 lf f0, 0(r1) // f at end of opcode
    indicate the double or floating point
    operation1 multf f4, f0, f2 //floating point
    multip f2 contains scalar A2 lf f0, 0(r2)
    3 addf f6,f4,f0 //floating point add4 sf
    f6, 0(r3) 5 addi r1,r1,8 // integer add6
    addi r2,r2,8 7 addi r3,r3,8 8 sgti r4,r1,
    MAX //r41 if (r1 gt MAX) else r40 Integer/ALU
    instruction9 beqz r4, loop
  • You will be asked to compute different schedules
    for this loop, for several different machine
    configurations.  For the purposes of this
    question, assume that branches are always
    predicted as taken, multf latency is 4 cycles,
    addf latency is 2 cycles, and all other
    operations execute in a single cycle. For parts a
    and b, assume no structural hazard (if not stated
    otherwise, assume all functional units are in
    multiple copies)
  • a.       Consider a conventional DLX pipeline
    with full forwarding and no provisions for
    precise state (i.e., out-of-order writebacks are
    allowed). Draw the pipeline diagram.  How many
    cycles does it take to execute a single iteration
    of this loop?  (i.e., from the completion of the
    first instruction in one iteration to the
    completion of the first instruction in the next
    iteration)?  Remember to respect all RAW hazards,
    and label all stalls.  Watch out for pipeline
    violations (i.e., two instructions in the same
    pipeline stage at the same time).
  • b.      Reschedule the code to separate dependent
    instructions from one another.  Draw the pipeline
    diagram for the rescheduled loop.  How many
    cycles do two unrolled iterations take?
  • c.       Go back to the original (pre-unrolling)
    loop. Consider a Scoreboard with 6 entries.  The
    machine has 5 functional units, 3 integer ALUs
    that perform the ALU operations and the loads and
    stores, and 2 FP units, which perform the addf
    and multf functions.  Fill in the instruction
    status table for the scoreboard (IS, RO, EX, and
    WB for each entry).  However, rather than an X in
    each slot, write the cycle number in which the
    given instruction completes the particular
    function.  Consider the cycle in which the
    instruction sgti is dispatched. Highlight the
    completed functions at that cycle.  Fill in the
    state of the register status table and the
    functional unit table, as it would be at the end
    of this cycle.  How many cycles does loop
    iteration take to execute? (Again, count from the
    end of the first instruction of one iteration to
    the end of the first instruction from the next
    iteration). Remember there is no bypassing.
  • d.      Assume, in addition to the description in
    part d, there is an additional constraint only
    one instruction can read its operands or write
    them back in any cycle, as there is only one bus
    that can be used by the functional units to
    access the register file. Assume the first
    instruction waiting for the operands can read
    (the availability of register info will be
    communicated via dedicated and individual control
    circuits so there is no contention there). Also
    assume that bus is wide enough and the one
    instruction can read as many registers as they
    want in a given cycle.

  • DLX pipeline with full forwarding
  • Branches predicted-as-taken
  • Multf latency 4 cycles
  • Addf latency 2 cycles
  • Others 1 cycles (executes in 1 cycle)
  • Parts a and b Multiple Fus (No structural hazard)

Part a
17 cycles are needed to execute a single
iteration (all but the last iteration) and 18
cycles are needed for the last iteration. (The
last iteration takes 19 cycles if there is no
code after the instructions of the loop).
Part b
With stalls
lf f0, 0(r1) stall multf f4, f0, f2 lf f0,
0(r2) stall stall stall addf f6,f4,f0 stall
// another stall comes due to WB structural
hazard sf f6, 0(r3) addi r1,r1,8 addi
r2,r2,8 addi r3,r3,8 sgti r4,r1, MAX stall
beqz r4, loop
lf f0, 0(r1) multf f4, f0, f2 lf f0, 0(r2)
addf f6,f4,f0 sf f6, 0(r3) addi r1,r1,8
addi r2,r2,8 addi r3,r3,8 sgti r4,r1, MAX
beqz r4, loop
Part b rescheduled 2 iteration loop
lf f0, 0(r1) stall multf f4, f0, f2 lf t1,
0(r2) stall stall stall addf f6,f4,t1 stall
sf f6, 0(r3) addi r1,r1,8 addi r2,r2,8
addi r3,r3,8 sgti r4,r1, MAX stall beqz r4,
loop lf t2, 8(r1) stall multf t5, t2, f2 lf
t3, 8(r2) stall stall stall addf t4,t5,t3
stall sf t4, 8(r3) addi r1,r1,16 addi
r2,r2,16 addi r3,r3,16 sgti r4,r1, MAX
stall beqz r4, loop
lf f0, 0(r1) lf t1, 0(r2)multf f4, f0, f2 lf
t2, 8(r1) lf t3, 8(r2) multf t5, t2, f2 addi
r1,r1,16 addf f6,f4,t1 addi r2,r2,16sf f6,
0(r3) addf t4,t5,t3 sgti r4,r1, MAX sf t4,
8(r3) beqz r4, loop addi r3,r3,16
Unrolling Renaming Rescheduling
Unrolled - scheduled
Even though we did scheduling to avoid all data
hazards, the structural hazards (limited write
and memory ports) are causing delays. 19/2 9.5
cycles are needed for a single iteration (2
iterations 15 instructions 4 stalls) Note It
is possible to reschedule the loop further to
reduce or eliminate the stalls due to structural
Part C Scoreboard - Notes
Part C Scoreboard
Loop iteration takes 20-4 16 cycles
Part D Scoreboard with a single bus to access
register file
Loop iteration takes 24-420 cycles
Example Short Questions
  • What is forwarding or bypassing?
  • Why do you need benchmark programs?
  • What are the main differences between RISC and
  • What is Amdhals law?
  • CPI vs. IPC What is the relation?
  • Will pipeline approach help the instruction
  • Why Tomasulos algorithm is better than
  • What are the advantages of compiler based
  • How do you measure the performance of a computer?

True or False questions with explanations
  • Deeper pipeline architectures yield higher
  • Floating point based graphics routines are better
    bench marks than many other programs.
  • Branch prediction algorithms are evaluated based
    on their performance measured against bench mark
  • Dynamic scheduling causes more structural
  • The relative performance of two processors with
    the same instruction set architecture (ISA) can
    be judged by clock rate or by the performance of
    a single benchmark suite.
  • An architecture with flaws cannot be successful
  • Stack architecture is better than accumulator
    architecture in conserving memory bus traffic.
  • Computer with higher clock-speed is likely to
    provide better performance
Write a Comment
User Comments (0)
About PowerShow.com