Title: Outline
1Chapter 2 ILP and Its Exploitation
- Review simple static pipeline
- ILP Overview
- Dynamic branch prediction
- Dynamic scheduling, out-of-order execution
- Multiple issue (superscalar)
- Hardware-based speculation
- ILP limitation
- Intel P6 microarchitecture
2Dynamic Scheduling
- If an instruction is stalled, theres no need to
stall later instructions that arent dependent on
any of the stalled instructions, i.e.
out-of-order execution - Example DIVD F0,F2,F4 ? Long-running
ADDD F10,F0,F8 ? Depends on DIVD SUBD
F12,F8,F14 ? Independent of both - The ADDD is stalled before execution, but the
SUBD can go ahead. - Encounter WAW, WAR harzards
3Splitting Instruction Decode
- Single Instruction Decode stage split into 2
parts - Instruction Issue or dispatch (in-order)
- Determine instruction type
- Check for structural hazards
- Read Operands (can be out-of-order)
- Stall instruction until no data hazards
- Read operands
- Release instruction to begin execution
- Need some sort of queue or buffer to hold
instructions till their operands are ready. - Note Out-of-order completion makes precise
exception handling difficult! How to handle?
Issue
Queue
Read Operand
Instruction Decode
4Tomasulos Algorithm
- Tomasulos algorithm
- Another approach for dynamic scheduling,
out-of-order execution - First used in IBM 360/91 FPU, many years ago
- Based on key concept of dynamic register renaming
- Like static renaming we used in loop-unroll
example - Some features
- Copes with long-latency operations (FPU or mem.)
- Eliminates WAR WAR hazards without stalling
- Instructions issue as soon as their operands are
ready, direct forwarding, bypass register - Distributed hazard detection and execution control
5Tomasulos Algorithm
- Key differences (from Scoreboarding)
- Hazard detection inst issue is done per
execution unit - Data results go straight to where they are
needed, use CDB - Loads/stores get their own execution units
- Use Reservation Station for register renaming
Issue Logic /Control Unit
CommonDataBus (CDB)
RegisterFile
Reser-vationStation
Execution unit 1
Instruction Fetch
Instruction Queue
Reser-vationStation
Execution unit 2
6Components of a Tomasulo Unit
- Reservation stations (RSs)
- Buffer the operands to pending instructions while
they are waiting for operands to enter the
execution units. - Issue logic
- Redirects (renames) instructions register
outputs to reservation-station slots. - Results go directly to RSs rather than thru reg.
file. - Distributed hazard detection
- Handled separately by each functional unit
- Load store buffers (can be combined with RS)
- Queue up memory access requests
7Simple FPU using Tomasulos Algorithm
8Major Steps in Tomasulo (Fig 2.12)
- Issue
- Get instruction from FP instruction queue
- If a slot in appropriate RS (or load-store
buffer) is available, send instruction there
else stall it (structural hazard). - Send operand values to RS if already available,
otherwise, just note the names (RS) where the
operands to be available - Execute
- While operands not yet available, monitor CDB for
them. - When all operands are in RS, begin executing
instruction. - Write result
- When result available CDB is free, write result
to CDB, then to registers RS/store slots for
receiving instructions. - Update register status, RSs value, flag, busy
state, etc.
9Example for Tomasulos Algorithm
- We will go through the same code fragment to see
how Tomasulos Algorithm handles out-of-order
Exec. - 1. LD F6,34(R2)
- 2. LD F2,45(R3)
- 3. MULTD F0,F2,F4
- 4. SUBD F8,F6,F2
- 5. DIVD F10,F0,F6
- 6. ADDD F6,F8,F2
DataDependence
Anti-Dependence
OutputDependence
10Reservation Station Fields
- In each slot
- Op - The operation to perform on operands S1 S2
- Qj, Qk - The RS slots that will produce S1, S2
- Vj, Vk - The values of S1 S2.
- Busy - RS its execution unit are occupied
- In register file entries store buffer slots
- Qi - The RS slot containing the op whose result
should be stored here. - In load and store buffers (combined in RS)
- A hold effective address for load and store.
11Tomasulo Example
12Cycle 1
13Cycle 2
Note Can have multiple loads outstanding
14Cycle 3
- Note registers names are removed (renamed) in
Reservation Stations MULT issued - Load1 completing what is waiting for Load1?
15Cycle 4
- Load2 completing what is waiting for Load2?
16Cycle 5
- Timer starts down for Add1, Mult1
17Cycle 6
- Issue ADDD here despite name dependency on F6?
18Cycle 7
- Add1 (SUBD) completing what is waiting for it?
19Cycle 8
20Cycle 9
21Cycle 10
- Add2 (ADDD) completing what is waiting for it?
22Cycle 11
- Write result of ADDD here?
- All quick instructions complete in this cycle!
23Cycle 12
24Cycle 13
25Cycle 14
26Cycle 15
- Mult1 (MULTD) completing what is waiting for it?
27Cycle 16
- Just waiting for Mult2 (DIVD) to complete
28Cycle 55 (after skip cycles)
29Cycle 56
- Mult2 (DIVD) is completing what is waiting for
it?
30Cycle 57
- Once again In-order issue, out-of-order
execution, and out-of-order completion.
31Tomasulos Two Major Advantages
- Distribution of the hazard detection logic
- distributed reservation stations and the CDB
- If multiple instructions waiting on single
result, each instruction has other operand,
then instructions can be released simultaneously
by broadcast on CDB - If a centralized register file were used, the
units would have to read their results from the
registers when register buses are available - Elimination of stalls for WAW and WAR hazards
32Elimination of WAR Hazards
- Note the potential WAR hazard between DIVD and
ADDD involving F6. - But, as soon as DIVD enters the RS, it becomes
independent of the ADDD! - The 2nd source operand no longer refers to F6,
but stores the value of F6 produced earlier by
the LD. - If the LD had not yet completed, the 2nd operand
would then refer to its R.S., but still not to
F6! - So, ADDD can write its new value for F6 before
DIVD executes, without messing it up!
33Elimination of WAW Hazards
- Note the potential WAW hazard between First LD
and last ADD involving F6. - But, as soon as ADD is issued, the register
status table is updated with F6 assigned to
adder2 - So, LD when it completes will not update F6, thus
eliminate WAW
34Tomasulo Drawbacks
- Complexity
- delays of 360/91, MIPS 10000, Alpha 21264, IBM
PPC 620 in CAAQA 2/e, but not in silicon! - Many associative stores (CDB) at high speed
- Performance limited by Common Data Bus
- Each CDB must go to multiple functional units
?high capacitance, high wiring density - Number of functional units that can complete per
cycle limited to one! - Multiple CDBs ? more FU logic for parallel assoc
stores - Non-precise interrupts!
- this will be addressed later
35Overlap Loop Interactions
- Register renaming
- Multiple iterations use different physical
destinations for registers (dynamic loop
unrolling). - Reservation stations
- Permit instruction issue to advance past integer
control flow operations - Also buffer old values of registers - totally
avoiding the WAR stall - Other perspective Tomasulo building data flow
dependency graph on the fly - Note, branch prediction is still needed!
36Dynamic Loop Scheduling
- Loop example
- Loop LD F0,0(R1)
- MULTD F4,F0,F2
- SD 0(R1),F4
- SUBI R1,R1,8
- BNEZ R1,Loop
- Note data dependences can span loop iterations.
- But, using Tomasulo, predict-taken, multiple
iterations can issue and begin execution
simultaneously! - Like dynamic loop unrolling by the HW.
37Check Figure 2.13