Lecture 9 Dynamic Scheduling of Pipeline - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Lecture 9 Dynamic Scheduling of Pipeline

Description:

Title: Lecture 7 Dynamic Scheduling of Pipeline Author: Last modified by: Created Date: 3/30/2001 8:32:30 AM Document presentation format – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 83
Provided by: 6649732
Category:

less

Transcript and Presenter's Notes

Title: Lecture 9 Dynamic Scheduling of Pipeline


1
Lecture 9Dynamic Scheduling of Pipeline
2
Static vs Dynamic Scheduling
  • Static Scheduling by compiler
  • Code motion for LD delay slots and branch delay
    slots
  • Code motion for avoiding data dependency
  • In-order instruction issue
  • If an instruction is stalled, no later
    instructions can proceed.
  • Multiple copies of a unit may be idle -
    inefficiency
  • Dynamic Scheduling by Hardware
  • Allow Out-of-order execution, Out-of-order
    completion
  • Even though an instruction is stalled, later
    instructions, with no data dependencies with
    the instructions which are stalled and causing
    the stall, can proceed
  • Efficient utilization of functional unit with
    multiple units

3
HW Schemes Instruction Parallelism
  • Why scheduling in HW at run time?
  • Works when dependencies are unknown at compile
    time
  • Simpler compiler
  • Code for one machine runs well on another
  • Key idea Allow instructions behind stall to
    proceed
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • In DLX, SUBD cannot be executed even if there
    is a separate adder available to maintain
    in- order-execution.
  • Enables out-of-order execution gt out-of-order
    completion
  • DLX ID stage checked both for structural hazards
    and data dependencies

4
HW Schemes Instruction Parallelism
  • Out-of-order execution divides ID stage
  • 1. Issue - Decode instructions, check for
    structural hazards
  • 2. Read operands - Wait until no data hazards,
    then read operands
  • Scoreboards(Control Data Corp. CDC 6600) allow
    instruction to execute whenever 1 2 hold, not
    waiting for prior instructions
  • Centralized implementation of Hazard Detection
    and Resolution
  • Every instruction goes through scoreboard
  • Scoreboard determines when instruction can read
    operands and begin execution
  • Monitoring every change in hardware and determine
    when to execute instruction

5
Scoreboard Implications
  • Out-of-order completion gt WAR, WAW hazards?
  • WAR WAW
  • ADDD R1,R2,R3 ADDD R1,R2,R3
  • LD R2,X LD R1,X
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • For WAW stall until other to complete
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units (superpipeline)
  • Scoreboard keeps track of dependencies, and the
    state of operations
  • Scoreboard replaces ID, EX, WB with 4 stages

6
4 Stages of Scoreboard Control1st Stage(ID1) -
Issue
  • Decode instructions and check for structural
    hazards
  • If functional unit for the instruction is free(no
    structural hazard), and no other active
    instruction has the same destination
    register(WAW)
  • Scoreboard issues instruction to functional
    unit
  • Updates internal data structure
  • If Structural Hazard or WAW Hazard exists
  • Stall instruction issue
  • No further instruction issue until hazards are
    cleared
  • IF/ID1 Buffer allows further instruction
    fetch(IF)

7
4 Stages of Scoreboard Control2nd Stage(ID2) -
Read Operands
  • Wait until no Data Hazard, then Read Operands
  • To prevent RAW,
  • If no earlier issued active instruction is going
    to writing it, or
  • If the register containing the operand is being
    written by none of the currently active
    functional units
  • Source operand is available for read
  • Scoreboard tells the functional unit to read
    and begin execution
  • Scoreboard resolves RAW Hazard dynamically
  • gt out of order execution

8
4 Stages of Scoreboard Control3rd Stage(EX) -
Execution
  • Operates on Operands
  • Functional Unit begins execution upon
    receiving operands
  • When the result is ready, the functional unit
    notifies the Scoreboard of the completion of
    execution

9
4 Stages of Scoreboard Control4th Stage(WB) -
Write Result
  • Finish Execution
  • When Scoreboard knows the functional unit
    completed execution
  • Scoreboard checks for WAR Hazard

    If not, it writes the results
    If
    WAR Hazard, it stalls the instruction
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • CDC 6600 scoreboard would stall SUBD until ADDD
    reads operands

10
(No Transcript)
11
3 Parts of the Scoreboard
1. Instruction status - Indicates which of 4
steps(Issue, Read Operands, Execution Complete,
Write Result) the instruction is
in 2. Functional unit status - Indicates the
state of the functional unit (FU).
9 fields for each functional unit Busy
Indicates whether the unit is busy or not Op
Operation to perform in the unit (e.g., or -
) Fi Destination register number Fj,
Fk Source-register numbers Qj, Qk Functional
units producing source registers Fj, Fk Rj, Rk
Flags indicating when Fj, Fk are
ready 3. Register result status - Indicates
which functional unit will write each register,
if one exists. Blank when no pending instructions
will write that register
12
Scoreboard Pipeline Control
Issue
Not busy (FU) and not result(D)
Busy(FU) yes Op(FU) op Fi(FU) D
Fj(FU) S1Fk(FU) S2 Qj Result(S1)
Qk Result(S2) Rj not Qj Rk not Qk
Result(D) FU
Read operands
Rj No Rk No Qj 0 Qk 0
Rj and Rk
Execution complete
Functional unit done
Write result
"f((Fj( f ) ¹ Fi(FU) or Rj(f)No) (Fk( f ) ¹
Fi(FU) or Rk( f )No))
"f(if Qj(f)FU then Rj(f) Yes)"f(if Qk(f)FU
then Rk(f) Yes) Result(Fi(FU)) 0
Busy(FU) No
f register number
13
Scoreboard Example
LD F6 34 R2 LD F2 45 R3 MULT F0 F2
F4 SUBD F8 F6 F2 DIVD F10 F0
F6 ADDD F6 F8 F2
14
Cycle 1
1
Int
15
Cycle 2
2
N
16
Cycle 3
3
17
Cycle 4
4
N
18
Cycle 5
5
19
Cycle 6
6
6
Mult1
20
Cycle 7
7
7
Add
Int
21
Cycle 8a
8
Div
Int
22
Cycle 8b
8
N
Y
Int
N
23
Cycle 9
8
9
9
Time
10 2
N
24
Cycle 11
11
8 0
25
Cycle 12
12
7
N
26
Cycle 13
13
6
Y
Add
27
Cycle 14
14
Y
N
N
28
Cycle 15
Y
29
Cycle 16
16
Y
30
Cycle 17
2
2
Y
31
Cycle 18
1
1
Y
32
Cycle 19
19
0
Y
33
Cycle 20
20
N
Y
Mult1
F0
Y
34
Cycle 21
21
35
Cycle 22
22
N

40
F6
36
Cycle 61
61
0
37
Cycle 62
62
N
38
Scoreboard Summary
Scoreboard Summary
  • Speedup 1.7 from compiler 2.5 by hand BUT slow
    memory (no cache) limits benefit
  • Limitations of 6600 scoreboard
  • No forwarding hardware
  • Limited to instructions in basic block (small
    window)
  • Small number of functional units (structural
    hazards)
  • Wait for WAR hazards
  • Prevent WAW hazards
  • Speedup 1.7 from FORTRAN program,
    2.5 by hand coded Assembly Language
    program BUT slow memory (no cache) limits
    benefit
  • Limitations of 6600 scoreboard
  • No forwarding hardware
  • Limited to instructions in basic block (small
    window)
  • Small number of functional units (structural
    hazards)
  • Wait for WAR hazards
  • Prevent WAW hazards

39
(No Transcript)
40
Case StudyTomasulo Algorithm
41
Limitations of Scoreboard
  • No forwarding
  • Limited to instructions in basic block (small
    window)
  • Number of functional units(structural hazards)
  • Wait for WAR hazards
  • Prevent WAW hazards

42
Another Dynamic Algorithm Tomasulo Algorithm
  • For IBM 360/91 about 3 years after CDC 6600
  • Goal High Performance without special compilers
  • Differences between IBM 360 CDC 6600 ISA
  • IBM has only 2 register specifiers/instr vs. 3 in
    CDC 6600
  • IBM has 4 FP registers vs. 8 in CDC 6600
  • Differences between Tomasulo Algorithm
    Scoreboard
  • Control buffers are distributed with Function
    Units, called reservation stations vs.
    centralized in scoreboard
  • Registers in instructions are replaced by
    pointers to reservation station buffer
  • HW renaming of registers to avoid WAR, WAW
    hazards
  • Common Data Bus(CDB) broadcasts results to all
    FUs
  • Load and Stores treated as FUs as well

43
Register Renaming
44
Tomasulo Organization
Reservation Station
Common Data Bus(CDB)
45
Reservation Station Components
Op Operation to perform in the unit (e.g., or
- ) Qj, Qk Reservation stations producing
source Vj, Vk. 0 indicates that Vj,Vk are
ready, eliminating Rj, Rk fields in
scoreboard Vj, Vk Value of Source
operands Busy Indicates reservation station and
FU is busy Register result status Indicates
which functional unit will write each
register, if one exists. Blank when no
pending instructions that will write that
register.
46
Three Stages of Tomasulo Algorithm
  • 1. Issue Get instruction from FP Op Queue
  • FP op If reservation station is free, issue
    instr, and send operation operands if they are
    in Regs(renames Regs).
  • LD/ST If Buffer is available, issue instr.
  • If reservation station or buffer is not
    available, structural hazard-stall
  • Register renaming
  • 2. Execution Operate on operands (EX)
  • When an operand is ready, put it in the
    reservation station.
  • If not ready, watch CDB for registers.
  • When both operands are available, execute
  • RAW check
  • 3. Write Result Finish execution (WB)
  • When result is available write on Common Data
    Bus, and from there to all
    awaiting units Registers, Reservation stations
  • Mark reservation station available.

47
Cycle 0
48
Cycle 1
Yes 3480
1
LD1
1
49
Cycle 2
Yes 4590
2
LD2
LD1
2
50
Cycle 3
3
3
Yes MULTD R(F4) LD2
0
Mult1
LD1
3
51
Cycle 4a
4
Yes SUBD LD1
LD2
LD1
Add1
4
52
Cycle 4b
4
4
M(114)
M(114)
53
Cycle 5
5
5
M(135)
2
10
Yes DIVD M(114) Mult1
Mult2
5
M(135)
54
Cycle 6
6
1
Yes ADDD M(135) Add1
9
Add2
6
55
Cycle 7
7
0
8
7
56
Cycle 8
8
No

7
8
57
(No Transcript)
58
Dynamic Loop Unrolling by Tomasulo
  • Eliminating WAW and WAR hazard by dynamic
    renaming of registers
  • Predict branch TAKEN will allow multiple
    instruction in the loop proceed in parallel
  • By the dynamic loop unrolling and register
    renaming, requirement of many registers in the
    loop unrolling can be avoided

59
Tomasulo Loop Example
  • Loop LD F0, 0(R1)
  • MULTD F4, F0, F2
  • SD 0(R1), F4
  • SUBI R1, R1, 8
  • BNEZ R1, Loop
  • This example shows dynamic loop unrolling, it
    shows the
  • completion of the first 2 iterations
  • Multiply takes 4 clocks
  • The Load in the 1st iteration has a cache miss
    which takes 8 cycles

60
Loop Example Cycle 0
61
Cycle 1
Yes 80
1
1
Load1
62
Cycle 2
Yes 80
LD cannot progress due to cache miss
1
2
Yes MULTD R(F2) Load1
1
2
Load1
Mult1
63
Cycle 3
X cannot progress due to F0
3
Mult1
Yes 80
3
64
Cycle 4
SD cannot progress due to F4
4
65
Cycle 5
5
Execute BNEZ to get to the 2nd iteration
66
Cycle 6
Yes 72
6
Load2
67
Cycle 7
7
Yes MULTD R(F2) Load2
7
Mult2
68
Cycle 8
Mult2
Yes 72
8
8
69
Cycle 9
9
70
Cycle 10
10
Start x 4
10
Execute BNEZ to get to the 3rd iteration
71
Cycle 11
N
11
3
Start x 4
11

72
Cycle 12
N
N
2
3
12
Load3
73
Cycle 13
N
N
1
2
13
74
Cycle 14
N N
14
12
0
1
14
75
Cycle 15
13 14
N N
15
12
M80R(F2)
15
0
15
76
Cycle 16
13 14
N N

12
15
16
16
M72R(F2)
N
16
77
Cycle 17
N
17
78
Cycle 18
18
N
18
79
Cycle 19
19
N
19
Execute BNEZ to get to the 4th iteration
80
Cycle 20
N
20
N
20
81
Cycle 21
21
N
21
82
Tomasulo Summary
  • Prevents Register as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks (provided branch
    prediction)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • Next More branch prediction
Write a Comment
User Comments (0)
About PowerShow.com