Outline - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Outline

Description:

... some sort of queue or buffer to hold instructions till their ... In load and store buffers (combined in RS): A : hold effective address for load and store. ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 38
Provided by: Engineerin109
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Outline


1
Chapter 2 ILP and Its Exploitation
  • Review simple static pipeline
  • ILP Overview
  • Dynamic branch prediction
  • Dynamic scheduling, out-of-order execution
  • Multiple issue (superscalar)
  • Hardware-based speculation
  • ILP limitation
  • Intel P6 microarchitecture

2
Dynamic Scheduling
  • If an instruction is stalled, theres no need to
    stall later instructions that arent dependent on
    any of the stalled instructions, i.e.
    out-of-order execution
  • Example DIVD F0,F2,F4 ? Long-running
    ADDD F10,F0,F8 ? Depends on DIVD SUBD
    F12,F8,F14 ? Independent of both
  • The ADDD is stalled before execution, but the
    SUBD can go ahead.
  • Encounter WAW, WAR harzards

3
Splitting Instruction Decode
  • Single Instruction Decode stage split into 2
    parts
  • Instruction Issue or dispatch (in-order)
  • Determine instruction type
  • Check for structural hazards
  • Read Operands (can be out-of-order)
  • Stall instruction until no data hazards
  • Read operands
  • Release instruction to begin execution
  • Need some sort of queue or buffer to hold
    instructions till their operands are ready.
  • Note Out-of-order completion makes precise
    exception handling difficult! How to handle?

Issue
Queue
Read Operand
Instruction Decode
4
Tomasulos Algorithm
  • Tomasulos algorithm
  • Another approach for dynamic scheduling,
    out-of-order execution
  • First used in IBM 360/91 FPU, many years ago
  • Based on key concept of dynamic register renaming
  • Like static renaming we used in loop-unroll
    example
  • Some features
  • Copes with long-latency operations (FPU or mem.)
  • Eliminates WAR WAR hazards without stalling
  • Instructions issue as soon as their operands are
    ready, direct forwarding, bypass register
  • Distributed hazard detection and execution control

5
Tomasulos Algorithm
  • Key differences (from Scoreboarding)
  • Hazard detection inst issue is done per
    execution unit
  • Data results go straight to where they are
    needed, use CDB
  • Loads/stores get their own execution units
  • Use Reservation Station for register renaming

Issue Logic /Control Unit
CommonDataBus (CDB)
RegisterFile
Reser-vationStation
Execution unit 1
Instruction Fetch
Instruction Queue
Reser-vationStation
Execution unit 2

6
Components of a Tomasulo Unit
  • Reservation stations (RSs)
  • Buffer the operands to pending instructions while
    they are waiting for operands to enter the
    execution units.
  • Issue logic
  • Redirects (renames) instructions register
    outputs to reservation-station slots.
  • Results go directly to RSs rather than thru reg.
    file.
  • Distributed hazard detection
  • Handled separately by each functional unit
  • Load store buffers (can be combined with RS)
  • Queue up memory access requests

7
Simple FPU using Tomasulos Algorithm
8
Major Steps in Tomasulo (Fig 2.12)
  • Issue
  • Get instruction from FP instruction queue
  • If a slot in appropriate RS (or load-store
    buffer) is available, send instruction there
    else stall it (structural hazard).
  • Send operand values to RS if already available,
    otherwise, just note the names (RS) where the
    operands to be available
  • Execute
  • While operands not yet available, monitor CDB for
    them.
  • When all operands are in RS, begin executing
    instruction.
  • Write result
  • When result available CDB is free, write result
    to CDB, then to registers RS/store slots for
    receiving instructions.
  • Update register status, RSs value, flag, busy
    state, etc.

9
Example for Tomasulos Algorithm
  • We will go through the same code fragment to see
    how Tomasulos Algorithm handles out-of-order
    Exec.
  • 1. LD F6,34(R2)
  • 2. LD F2,45(R3)
  • 3. MULTD F0,F2,F4
  • 4. SUBD F8,F6,F2
  • 5. DIVD F10,F0,F6
  • 6. ADDD F6,F8,F2

DataDependence
Anti-Dependence
OutputDependence
10
Reservation Station Fields
  • In each slot
  • Op - The operation to perform on operands S1 S2
  • Qj, Qk - The RS slots that will produce S1, S2
  • Vj, Vk - The values of S1 S2.
  • Busy - RS its execution unit are occupied
  • In register file entries store buffer slots
  • Qi - The RS slot containing the op whose result
    should be stored here.
  • In load and store buffers (combined in RS)
  • A hold effective address for load and store.

11
Tomasulo Example
12
Cycle 1
13
Cycle 2
Note Can have multiple loads outstanding
14
Cycle 3
  • Note registers names are removed (renamed) in
    Reservation Stations MULT issued
  • Load1 completing what is waiting for Load1?

15
Cycle 4
  • Load2 completing what is waiting for Load2?

16
Cycle 5
  • Timer starts down for Add1, Mult1

17
Cycle 6
  • Issue ADDD here despite name dependency on F6?

18
Cycle 7
  • Add1 (SUBD) completing what is waiting for it?

19
Cycle 8
20
Cycle 9
21
Cycle 10
  • Add2 (ADDD) completing what is waiting for it?

22
Cycle 11
  • Write result of ADDD here?
  • All quick instructions complete in this cycle!

23
Cycle 12
24
Cycle 13
25
Cycle 14
26
Cycle 15
  • Mult1 (MULTD) completing what is waiting for it?

27
Cycle 16
  • Just waiting for Mult2 (DIVD) to complete

28
Cycle 55 (after skip cycles)
29
Cycle 56
  • Mult2 (DIVD) is completing what is waiting for
    it?

30
Cycle 57
  • Once again In-order issue, out-of-order
    execution, and out-of-order completion.

31
Tomasulos Two Major Advantages
  • Distribution of the hazard detection logic
  • distributed reservation stations and the CDB
  • If multiple instructions waiting on single
    result, each instruction has other operand,
    then instructions can be released simultaneously
    by broadcast on CDB
  • If a centralized register file were used, the
    units would have to read their results from the
    registers when register buses are available
  • Elimination of stalls for WAW and WAR hazards

32
Elimination of WAR Hazards
  • Note the potential WAR hazard between DIVD and
    ADDD involving F6.
  • But, as soon as DIVD enters the RS, it becomes
    independent of the ADDD!
  • The 2nd source operand no longer refers to F6,
    but stores the value of F6 produced earlier by
    the LD.
  • If the LD had not yet completed, the 2nd operand
    would then refer to its R.S., but still not to
    F6!
  • So, ADDD can write its new value for F6 before
    DIVD executes, without messing it up!

33
Elimination of WAW Hazards
  • Note the potential WAW hazard between First LD
    and last ADD involving F6.
  • But, as soon as ADD is issued, the register
    status table is updated with F6 assigned to
    adder2
  • So, LD when it completes will not update F6, thus
    eliminate WAW

34
Tomasulo Drawbacks
  • Complexity
  • delays of 360/91, MIPS 10000, Alpha 21264, IBM
    PPC 620 in CAAQA 2/e, but not in silicon!
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
  • Each CDB must go to multiple functional units
    ?high capacitance, high wiring density
  • Number of functional units that can complete per
    cycle limited to one!
  • Multiple CDBs ? more FU logic for parallel assoc
    stores
  • Non-precise interrupts!
  • this will be addressed later

35
Overlap Loop Interactions
  • Register renaming
  • Multiple iterations use different physical
    destinations for registers (dynamic loop
    unrolling).
  • Reservation stations
  • Permit instruction issue to advance past integer
    control flow operations
  • Also buffer old values of registers - totally
    avoiding the WAR stall
  • Other perspective Tomasulo building data flow
    dependency graph on the fly
  • Note, branch prediction is still needed!

36
Dynamic Loop Scheduling
  • Loop example
  • Loop LD F0,0(R1)
  • MULTD F4,F0,F2
  • SD 0(R1),F4
  • SUBI R1,R1,8
  • BNEZ R1,Loop
  • Note data dependences can span loop iterations.
  • But, using Tomasulo, predict-taken, multiple
    iterations can issue and begin execution
    simultaneously!
  • Like dynamic loop unrolling by the HW.

37
Check Figure 2.13
Write a Comment
User Comments (0)
About PowerShow.com