Title: Lecture 9 Dynamic Scheduling of Pipeline
1Lecture 9Dynamic Scheduling of Pipeline
2Static vs Dynamic Scheduling
- Static Scheduling by compiler
- Code motion for LD delay slots and branch delay
slots - Code motion for avoiding data dependency
- In-order instruction issue
- If an instruction is stalled, no later
instructions can proceed. - Multiple copies of a unit may be idle -
inefficiency
- Dynamic Scheduling by Hardware
- Allow Out-of-order execution, Out-of-order
completion - Even though an instruction is stalled, later
instructions, with no data dependencies with
the instructions which are stalled and causing
the stall, can proceed - Efficient utilization of functional unit with
multiple units
3HW Schemes Instruction Parallelism
- Why scheduling in HW at run time?
- Works when dependencies are unknown at compile
time - Simpler compiler
- Code for one machine runs well on another
- Key idea Allow instructions behind stall to
proceed - DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F8,F8,F14
- In DLX, SUBD cannot be executed even if there
is a separate adder available to maintain
in- order-execution. - Enables out-of-order execution gt out-of-order
completion - DLX ID stage checked both for structural hazards
and data dependencies
4HW Schemes Instruction Parallelism
- Out-of-order execution divides ID stage
- 1. Issue - Decode instructions, check for
structural hazards - 2. Read operands - Wait until no data hazards,
then read operands - Scoreboards(Control Data Corp. CDC 6600) allow
instruction to execute whenever 1 2 hold, not
waiting for prior instructions - Centralized implementation of Hazard Detection
and Resolution - Every instruction goes through scoreboard
- Scoreboard determines when instruction can read
operands and begin execution - Monitoring every change in hardware and determine
when to execute instruction
5Scoreboard Implications
- Out-of-order completion gt WAR, WAW hazards?
- WAR WAW
- ADDD R1,R2,R3 ADDD R1,R2,R3
- LD R2,X LD R1,X
- Solutions for WAR
- Queue both the operation and copies of its
operands - Read registers only during Read Operands stage
- For WAW stall until other to complete
- Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units (superpipeline) - Scoreboard keeps track of dependencies, and the
state of operations - Scoreboard replaces ID, EX, WB with 4 stages
64 Stages of Scoreboard Control1st Stage(ID1) -
Issue
- Decode instructions and check for structural
hazards - If functional unit for the instruction is free(no
structural hazard), and no other active
instruction has the same destination
register(WAW) - Scoreboard issues instruction to functional
unit - Updates internal data structure
- If Structural Hazard or WAW Hazard exists
- Stall instruction issue
- No further instruction issue until hazards are
cleared - IF/ID1 Buffer allows further instruction
fetch(IF)
74 Stages of Scoreboard Control2nd Stage(ID2) -
Read Operands
- Wait until no Data Hazard, then Read Operands
- To prevent RAW,
- If no earlier issued active instruction is going
to writing it, or - If the register containing the operand is being
written by none of the currently active
functional units - Source operand is available for read
- Scoreboard tells the functional unit to read
and begin execution - Scoreboard resolves RAW Hazard dynamically
- gt out of order execution
84 Stages of Scoreboard Control3rd Stage(EX) -
Execution
- Operates on Operands
- Functional Unit begins execution upon
receiving operands - When the result is ready, the functional unit
notifies the Scoreboard of the completion of
execution
94 Stages of Scoreboard Control4th Stage(WB) -
Write Result
- Finish Execution
- When Scoreboard knows the functional unit
completed execution - Scoreboard checks for WAR Hazard
If not, it writes the results
If
WAR Hazard, it stalls the instruction - Example
- DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F8,F8,F14
- CDC 6600 scoreboard would stall SUBD until ADDD
reads operands
10(No Transcript)
113 Parts of the Scoreboard
1. Instruction status - Indicates which of 4
steps(Issue, Read Operands, Execution Complete,
Write Result) the instruction is
in 2. Functional unit status - Indicates the
state of the functional unit (FU).
9 fields for each functional unit Busy
Indicates whether the unit is busy or not Op
Operation to perform in the unit (e.g., or -
) Fi Destination register number Fj,
Fk Source-register numbers Qj, Qk Functional
units producing source registers Fj, Fk Rj, Rk
Flags indicating when Fj, Fk are
ready 3. Register result status - Indicates
which functional unit will write each register,
if one exists. Blank when no pending instructions
will write that register
12Scoreboard Pipeline Control
Issue
Not busy (FU) and not result(D)
Busy(FU) yes Op(FU) op Fi(FU) D
Fj(FU) S1Fk(FU) S2 Qj Result(S1)
Qk Result(S2) Rj not Qj Rk not Qk
Result(D) FU
Read operands
Rj No Rk No Qj 0 Qk 0
Rj and Rk
Execution complete
Functional unit done
Write result
"f((Fj( f ) ¹ Fi(FU) or Rj(f)No) (Fk( f ) ¹
Fi(FU) or Rk( f )No))
"f(if Qj(f)FU then Rj(f) Yes)"f(if Qk(f)FU
then Rk(f) Yes) Result(Fi(FU)) 0
Busy(FU) No
f register number
13Scoreboard Example
LD F6 34 R2 LD F2 45 R3 MULT F0 F2
F4 SUBD F8 F6 F2 DIVD F10 F0
F6 ADDD F6 F8 F2
14Cycle 1
1
Int
15Cycle 2
2
N
16Cycle 3
3
17Cycle 4
4
N
18Cycle 5
5
19Cycle 6
6
6
Mult1
20Cycle 7
7
7
Add
Int
21Cycle 8a
8
Div
Int
22Cycle 8b
8
N
Y
Int
N
23Cycle 9
8
9
9
Time
10 2
N
24Cycle 11
11
8 0
25Cycle 12
12
7
N
26Cycle 13
13
6
Y
Add
27Cycle 14
14
Y
N
N
28Cycle 15
Y
29Cycle 16
16
Y
30Cycle 17
2
2
Y
31Cycle 18
1
1
Y
32Cycle 19
19
0
Y
33Cycle 20
20
N
Y
Mult1
F0
Y
34Cycle 21
21
35Cycle 22
22
N
40
F6
36Cycle 61
61
0
37Cycle 62
62
N
38Scoreboard Summary
Scoreboard Summary
- Speedup 1.7 from compiler 2.5 by hand BUT slow
memory (no cache) limits benefit - Limitations of 6600 scoreboard
- No forwarding hardware
- Limited to instructions in basic block (small
window) - Small number of functional units (structural
hazards) - Wait for WAR hazards
- Prevent WAW hazards
- Speedup 1.7 from FORTRAN program,
2.5 by hand coded Assembly Language
program BUT slow memory (no cache) limits
benefit - Limitations of 6600 scoreboard
- No forwarding hardware
- Limited to instructions in basic block (small
window) - Small number of functional units (structural
hazards) - Wait for WAR hazards
- Prevent WAW hazards
39(No Transcript)
40Case StudyTomasulo Algorithm
41Limitations of Scoreboard
- No forwarding
- Limited to instructions in basic block (small
window) - Number of functional units(structural hazards)
- Wait for WAR hazards
- Prevent WAW hazards
42Another Dynamic Algorithm Tomasulo Algorithm
- For IBM 360/91 about 3 years after CDC 6600
- Goal High Performance without special compilers
- Differences between IBM 360 CDC 6600 ISA
- IBM has only 2 register specifiers/instr vs. 3 in
CDC 6600 - IBM has 4 FP registers vs. 8 in CDC 6600
- Differences between Tomasulo Algorithm
Scoreboard - Control buffers are distributed with Function
Units, called reservation stations vs.
centralized in scoreboard - Registers in instructions are replaced by
pointers to reservation station buffer - HW renaming of registers to avoid WAR, WAW
hazards - Common Data Bus(CDB) broadcasts results to all
FUs - Load and Stores treated as FUs as well
43Register Renaming
44Tomasulo Organization
Reservation Station
Common Data Bus(CDB)
45Reservation Station Components
Op Operation to perform in the unit (e.g., or
- ) Qj, Qk Reservation stations producing
source Vj, Vk. 0 indicates that Vj,Vk are
ready, eliminating Rj, Rk fields in
scoreboard Vj, Vk Value of Source
operands Busy Indicates reservation station and
FU is busy Register result status Indicates
which functional unit will write each
register, if one exists. Blank when no
pending instructions that will write that
register.
46Three Stages of Tomasulo Algorithm
- 1. Issue Get instruction from FP Op Queue
- FP op If reservation station is free, issue
instr, and send operation operands if they are
in Regs(renames Regs). - LD/ST If Buffer is available, issue instr.
- If reservation station or buffer is not
available, structural hazard-stall - Register renaming
- 2. Execution Operate on operands (EX)
- When an operand is ready, put it in the
reservation station.
- If not ready, watch CDB for registers.
- When both operands are available, execute
- RAW check
- 3. Write Result Finish execution (WB)
- When result is available write on Common Data
Bus, and from there to all
awaiting units Registers, Reservation stations - Mark reservation station available.
47Cycle 0
48Cycle 1
Yes 3480
1
LD1
1
49Cycle 2
Yes 4590
2
LD2
LD1
2
50Cycle 3
3
3
Yes MULTD R(F4) LD2
0
Mult1
LD1
3
51Cycle 4a
4
Yes SUBD LD1
LD2
LD1
Add1
4
52Cycle 4b
4
4
M(114)
M(114)
53Cycle 5
5
5
M(135)
2
10
Yes DIVD M(114) Mult1
Mult2
5
M(135)
54Cycle 6
6
1
Yes ADDD M(135) Add1
9
Add2
6
55Cycle 7
7
0
8
7
56Cycle 8
8
No
7
8
57(No Transcript)
58Dynamic Loop Unrolling by Tomasulo
- Eliminating WAW and WAR hazard by dynamic
renaming of registers - Predict branch TAKEN will allow multiple
instruction in the loop proceed in parallel - By the dynamic loop unrolling and register
renaming, requirement of many registers in the
loop unrolling can be avoided -
59Tomasulo Loop Example
- Loop LD F0, 0(R1)
- MULTD F4, F0, F2
- SD 0(R1), F4
- SUBI R1, R1, 8
- BNEZ R1, Loop
- This example shows dynamic loop unrolling, it
shows the - completion of the first 2 iterations
- Multiply takes 4 clocks
- The Load in the 1st iteration has a cache miss
which takes 8 cycles
60Loop Example Cycle 0
61Cycle 1
Yes 80
1
1
Load1
62Cycle 2
Yes 80
LD cannot progress due to cache miss
1
2
Yes MULTD R(F2) Load1
1
2
Load1
Mult1
63Cycle 3
X cannot progress due to F0
3
Mult1
Yes 80
3
64Cycle 4
SD cannot progress due to F4
4
65Cycle 5
5
Execute BNEZ to get to the 2nd iteration
66Cycle 6
Yes 72
6
Load2
67Cycle 7
7
Yes MULTD R(F2) Load2
7
Mult2
68Cycle 8
Mult2
Yes 72
8
8
69Cycle 9
9
70Cycle 10
10
Start x 4
10
Execute BNEZ to get to the 3rd iteration
71Cycle 11
N
11
3
Start x 4
11
72Cycle 12
N
N
2
3
12
Load3
73Cycle 13
N
N
1
2
13
74Cycle 14
N N
14
12
0
1
14
75Cycle 15
13 14
N N
15
12
M80R(F2)
15
0
15
76Cycle 16
13 14
N N
12
15
16
16
M72R(F2)
N
16
77Cycle 17
N
17
78Cycle 18
18
N
18
79Cycle 19
19
N
19
Execute BNEZ to get to the 4th iteration
80Cycle 20
N
20
N
20
81Cycle 21
21
N
21
82Tomasulo Summary
- Prevents Register as bottleneck
- Avoids WAR, WAW hazards of Scoreboard
- Allows loop unrolling in HW
- Not limited to basic blocks (provided branch
prediction) - Lasting Contributions
- Dynamic scheduling
- Register renaming
- Load/store disambiguation
- Next More branch prediction