COMP 206: Computer Architecture and Implementation - PowerPoint PPT Presentation

About This Presentation
Title:

COMP 206: Computer Architecture and Implementation

Description:

COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 10, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo s Algorithm) – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 50
Provided by: Montek3
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: COMP 206: Computer Architecture and Implementation


1
COMP 206Computer Architecture and Implementation
  • Montek Singh
  • Mon, Oct 10, 2005
  • Topic Instruction-Level Parallelism
  • (Dynamic Scheduling Tomasulos Algorithm)

2
Reading
  • Chapter 3 ILP and Its Dynamic Exploitation
  • Section 3.1-3.3

3
Dynamic Scheduling Tomasulos Algorithm
  • For IBM 360/91 (about three years after CDC 6600)
  • Goal High performance without special compilers
  • Differences between IBM 360 and CDC 6600 ISA
  • IBM has only 2 register specifiers/instruction
    versus 3 in CDC 6600
  • IBM has 4 FP registers versus 8 in CDC 6600
  • Differences between Tomasulo Algorithm and
    Scoreboard
  • Control and buffers distributed with Function
    Units versus centralized in scoreboard called
    reservation stations
  • Registers in instructions replaced by pointers to
    reservation station buffer
  • Hardware renaming of registers to avoid WAR and
    WAW hazards
  • Common Data Bus broadcasts results to all FUs
    (forwarding)
  • Load and Stores treated as FUs as well

4
Tomasulo Organization
5
More Details of Tomasulo Organization
  • Entities that produce values are assigned 4-bit
    tags
  • 1, 2, 3, 4, 5, 6 for load buffers
  • 8, 9 for multiplier reservation stations
  • 10, 11, 12 for adder reservation stations
  • Tag 0 indicates presence of valid data
  • FP registers have busy bits
  • 0 means that register holds valid data
  • 1 means that it is waiting to receive value from
    source identified by its tag field

6
Tomasulo Representing Data Dependences
  • Inputs
  • Operand is a register with busy bit 0
  • Data copied immediately (through register bus)
    into reservation station
  • Tag field of RS set to 0
  • Operand is a register with busy bit 1
  • Tag field of RS receives a copy of the register
    tag field
  • Operand is a load buffer that contains valid data
  • Data copied into RS
  • Operand is a load buffer that is awaiting data
  • Tag field of RS receives tag of load buffer
  • Outputs
  • Output is a register
  • Busy bit set to 1, tag set to RS tag
  • Output is a store buffer
  • Tag set to RS tag, destination address set

7
Three Stages of Tomasulo Algorithm
  • Issue get instruction from FP operation queue
  • If reservation station free, the scoreboard
    issues instruction and sends operands (renames
    registers)
  • Execution operate on operands (EX)
  • When both operands ready then executeif not
    ready, watch CDB for result
  • Write Result finish execution (WB)
  • Write on Common Data Bus to all awaiting units
    mark reservation station available

8
Tomasulo State Transitions
Load Buffer
Register
  • In case of a CDB conflict, earlier instruction
    has priority
  • If more than one instruction is enabled in the
    reservation stations of
  • adder or multiplier in same cycle, top entry has
    priority
  • If CDB transfer and issue occur in same cycle,
    CDB transfer is assumed to
  • occur first
  • Every instruction should spend at least one cycle
    in R stage
  • If an instruction being issued both reads and
    writes the same register, and the
  • source operand is actually in the register (busy
    bit 0), then first the register is
  • read, and then its busy bit is turned to 1,
    making it unreadable

9
Tomasulo Example
100 F0 ? A 101 F0 ? F0 F1 102 F0 ? F0
B 103 F2 ? F2 F3 104 F1 ? F1 F2 105 C ?
F1 106 F0 ? F1 / F0
I Issued R In reservation station X In
execution W Writing result through CDB
10
Tomasulo Example Cycle 0
  • System is quiescent

11
Tomasulo Example Cycle 1
  • (A) will arrive at tag 4
  • (F0) will come from tag 4
  • F0 is set to busy

12
Tomasulo Example Cycle 2
  • (F0) will be produced at tag 10
  • Right input of adder came from register (tag bit
    0)
  • Left input of adder will come from tag 4
  • Forwarding tag of F0 has been changed from 4 to 10

13
Tomasulo Example Cycle 3
  • (F0) will be produced at tag 11
  • (B) will arrive at tag 3
  • Right input of adder will come from tag 3
  • Left input of adder will come from tag 10
  • (A) arrives from memory
  • Forwarding tag of F0 has been changed from 10 to
    11

14
Tomasulo Example Cycle 4
  • (F2) will be produced at tag 12
  • Right input of adder came from register (tag bit
    0)
  • Left input of adder came from register (tag bit
    0)
  • (A) with tag 4 is broadcast on CDB
  • Adder (at tag 10) picks it up, and is thereby
    enabled
  • The instruction that will write F2 has already
    read the old contents of F2

15
Tomasulo Example Cycle 5
  • (F1) will be produced at tag 8
  • Right input of multiplier will come from tag 12
  • Left input of multiplier came from register (tag
    bit 0)
  • Adder (at tag 10) starts computing
  • (B) arrives from memory

16
Tomasulo Example Cycle 6
  • Memory address of destination is C
  • Data will come from tag 8
  • Adder (at tag 10) finishes computing
  • (B) with tag 3 is broadcast on CDB
  • Adder (at tag 11) picks it up
  • Adder (at tag 12) starts computing

17
Tomasulo Example Cycle 7
  • (F1) will be produced at tag 9
  • Right input of divider will come from tag 11
  • Left input of divider will come from tag 8
  • Result of adder (with tag 10) is broadcast on CDB
  • Adder (at tag 11) picks it up and is thereby
    enabled
  • Adder (at tag 12) finishes computing

18
Tomasulo Example Cycle 8
  • Result of adder (at tag 12) is broadcast on CDB
  • Multiplier (at tag 12) picks it up and is thereby
    enabled

19
Tomasulo Example Cycle 9
  • Multiplier (at tag 8) starts computing
  • Adder (at tag 11) finishes computing

20
Tomasulo Example Cycle 10
  • Result of adder (with tag 11) is broadcast on CDB
  • Divider (at tag 9) picks it up
  • Register F0 picks it up

21
Tomasulo Example Cycle 11
  • Multiplier (at tag 8) finishes computing

22
Tomasulo Example Cycle 12
  • Result of multiplier (at tag 8) is broadcast on
    CDB
  • Divider (at tag 9) picks it up, and is thereby
    enabled
  • Store buffer (at tag 1) picks it up, and is
    thereby enabled

23
Observations on Tomasulos Algorithm
  • Instructions move from decoder to reservation
    stations
  • in program order
  • dependences can be correctly recorded
  • Data Flow Graph The graph of pointers
    connecting the RS, registers, and memory buffers
  • helps accomplish out-of-order sequencing of
    instructions
  • Chief cost of this scheme high-speed
    associative hardware
  • RS hardware has to search for tags when CDB
    broadcasts some value with its tag
  • Full load bypassing is supported
  • load and store buffers are treated just like
    functional units
  • additional hardware on 360/91 also supported load
    forwarding

24
Tomasulo Example of Load Bypassing
  • Instruction 202 depends on instructions 200 and
    201, so instruction 203 will start executing much
    before 202 (assuming C and D are found to be
    different memory addresses)
  • Work out details off-line

200 F0 ? A 201 F0 ? F0 / F1 202 C ? F0 203 F0
? D 204 F0 ? F0 F2
25
Tomasulo Loop Unrolling in Hardware
  • 360/91 supported limited kind of speculation
  • Small loops could be held in a loop buffer
  • Loop closing branches were predicted as taken
  • This has the effect of loop unrolling at run-time
  • Given the small number of FP registers in
    machine, software loop unrolling was not a viable
    option

26
Tomasulo Loop Example
  • Loop L.D F0 0 R1
  • MULT.D F4 F0 F2
  • S.D F4 0 R1
  • SUBI R1 R1 8
  • BNEZ R1 Loop
  • Multiply takes 4 clocks
  • Loads have cache misses

27
Loop Example Cycle 0
28
Loop Example Cycle 1
29
Loop Example Cycle 2
30
Loop Example Cycle 3
31
Loop Example Cycle 4
32
Loop Example Cycle 5
33
Loop Example Cycle 6
Load2
34
Loop Example Cycle 7
35
Loop Example Cycle 8
36
Loop Example Cycle 9
37
Loop Example Cycle 10
38
Loop Example Cycle 11
39
Loop Example Cycle 12
40
Loop Example Cycle 13
41
Loop Example Cycle 14
42
Loop Example Cycle 15
43
Loop Example Cycle 16
44
Loop Example Cycle 17
45
Loop Example Cycle 18
46
Loop Example Cycle 19
47
Loop Example Cycle 20
48
Loop Example Cycle 21
49
Summary of Tomasulos Algorithm
  • Prevents registers as bottleneck
  • Avoids WAR and WAW hazards of scoreboard
  • Allows loop unrolling in hardware
  • Not limited to basic blocks (provided we have
    branch prediction)
  • Lasting contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • Next Dynamic branch prediction
Write a Comment
User Comments (0)
About PowerShow.com