Title: COMP 206: Computer Architecture and Implementation
1COMP 206Computer Architecture and Implementation
- Montek Singh
- Mon., Oct. 6, 2003
- Topic Instruction-Level Parallelism
- (Dynamic Scheduling Tomasulos Algorithm)
2Reading
- Chapter 3 ILP and Its Dynamic Exploitation
- Section 3.1-3.3
3Dynamic Scheduling Tomasulos Algorithm
- For IBM 360/91 (about three years after CDC 6600)
- Goal High performance without special compilers
- Differences between IBM 360 and CDC 6600 ISA
- IBM has only 2 register specifiers/instruction
versus 3 in CDC 6600 - IBM has 4 FP registers versus 8 in CDC 6600
- Differences between Tomasulo Algorithm and
Scoreboard - Control and buffers distributed with Function
Units versus centralized in scoreboard called
reservation stations - Registers in instructions replaced by pointers to
reservation station buffer - Hardware renaming of registers to avoid WAR and
WAW hazards - Common Data Bus broadcasts results to all FUs
(forwarding) - Load and Stores treated as FUs as well
4Tomasulo Organization
5More Details of Tomasulo Organization
- Entities that produce values are assigned 4-bit
tags - 1, 2, 3, 4, 5, 6 for load buffers
- 8, 9 for multiplier reservation stations
- 10, 11, 12 for adder reservation stations
- Tag 0 indicates presence of valid data
- FP registers have busy bits
- 0 means that register holds valid data
- 1 means that it is waiting to receive value from
source identified by its tag field
6Tomasulo Representing Data Dependences
- Inputs
- Operand is a register with busy bit 0
- Data copied immediately (through register bus)
into reservation station - Tag field of RS set to 0
- Operand is a register with busy bit 1
- Tag field of RS receives a copy of the register
tag field - Operand is a load buffer that contains valid data
- Data copied into RS
- Operand is a load buffer that is awaiting data
- Tag field of RS receives tag of load buffer
- Outputs
- Output is a register
- Busy bit set to 1, tag set to RS tag
- Output is a store buffer
- Tag set to RS tag, destination address set
7Three Stages of Tomasulo Algorithm
- Issue get instruction from FP operation queue
- If reservation station free, the scoreboard
issues instruction and sends operands (renames
registers) - Execution operate on operands (EX)
- When both operands ready then executeif not
ready, watch CDB for result - Write Result finish execution (WB)
- Write on Common Data Bus to all awaiting units
mark reservation station available
8Tomasulo State Transitions
Load Buffer
Register
- In case of a CDB conflict, earlier instruction
has priority - If more than one instruction is enabled in the
reservation stations of - adder or multiplier in same cycle, top entry has
priority - If CDB transfer and issue occur in same cycle,
CDB transfer is assumed to - occur first
- Every instruction should spend at least one cycle
in R stage - If an instruction being issued both reads and
writes the same register, and the - source operand is actually in the register (busy
bit 0), then first the register is - read, and then its busy bit is turned to 1,
making it unreadable
9Tomasulo Example
100 F0 ? A 101 F0 ? F0 F1 102 F0 ? F0
B 103 F2 ? F2 F3 104 F1 ? F1 F2 105 C ?
F1 106 F0 ? F1 / F0
I Issued R In reservation station X In
execution W Writing result through CDB
10Tomasulo Example Cycle 0
11Tomasulo Example Cycle 1
- (A) will arrive at tag 4
- (F0) will come from tag 4
- F0 is set to busy
12Tomasulo Example Cycle 2
- (F0) will be produced at tag 10
- Right input of adder came from register (tag bit
0) - Left input of adder will come from tag 4
- Forwarding tag of F0 has been changed from 4 to 10
13Tomasulo Example Cycle 3
- (F0) will be produced at tag 11
- (B) will arrive at tag 3
- Right input of adder will come from tag 3
- Left input of adder will come from tag 10
- (A) arrives from memory
- Forwarding tag of F0 has been changed from 10 to
11
14Tomasulo Example Cycle 4
- (F2) will be produced at tag 12
- Right input of adder came from register (tag bit
0) - Left input of adder came from register (tag bit
0) - (A) with tag 4 is broadcast on CDB
- Adder (at tag 10) picks it up, and is thereby
enabled - The instruction that will write F2 has already
read the old contents of F2
15Tomasulo Example Cycle 5
- (F1) will be produced at tag 8
- Right input of multiplier will come from tag 12
- Left input of multiplier came from register (tag
bit 0) - Adder (at tag 10) starts computing
- (B) arrives from memory
16Tomasulo Example Cycle 6
- Memory address of destination is C
- Data will come from tag 8
- Adder (at tag 10) finishes computing
- (B) with tag 3 is broadcast on CDB
- Adder (at tag 11) picks it up
- Adder (at tag 12) starts computing
17Tomasulo Example Cycle 7
- (F1) will be produced at tag 9
- Right input of divider will come from tag 11
- Left input of divider will come from tag 8
- Result of adder (with tag 10) is broadcast on CDB
- Adder (at tag 11) picks it up and is thereby
enabled - Adder (at tag 12) finishes computing
18Tomasulo Example Cycle 8
- Result of adder (at tag 12) is broadcast on CDB
- Multiplier (at tag 12) picks it up and is thereby
enabled
19Tomasulo Example Cycle 9
- Multiplier (at tag 8) starts computing
- Adder (at tag 11) finishes computing
20Tomasulo Example Cycle 10
- Result of adder (with tag 11) is broadcast on CDB
- Divider (at tag 9) picks it up
- Register F0 picks it up
21Tomasulo Example Cycle 11
- Multiplier (at tag 8) finishes computing
22Tomasulo Example Cycle 12
- Result of multiplier (at tag 8) is broadcast on
CDB - Divider (at tag 9) picks it up, and is thereby
enabled - Store buffer (at tag 1) picks it up, and is
thereby enabled
23Observations on Tomasulos Algorithm
- Instructions move from decoder to reservation
stations - in program order
- dependences can be correctly recorded
- Data Flow Graph The graph of pointers
connecting the RS, registers, and memory buffers - helps accomplish out-of-order sequencing of
instructions - Chief cost of this scheme high-speed
associative hardware - RS hardware has to search for tags when CDB
broadcasts some value with its tag - Full load bypassing is supported
- load and store buffers are treated just like
functional units - additional hardware on 360/91 also supported load
forwarding
24Tomasulo Example of Load Bypassing
- Instruction 202 depends on instructions 200 and
201, so instruction 203 will start executing much
before 202 (assuming C and D are found to be
different memory addresses) - Work out details off-line
200 F0 ? A 201 F0 ? F0 / F1 202 C ? F0 203 F0
? D 204 F0 ? F0 F2
25Tomasulo Loop Unrolling in Hardware
- 360/91 supported limited kind of speculation
- Small loops could be held in a loop buffer
- Loop closing branches were predicted as taken
- This has the effect of loop unrolling at run-time
- Given the small number of FP registers in
machine, software loop unrolling was not a viable
option
26Tomasulo Loop Example
- Loop L.D F0 0 R1
- MULT.D F4 F0 F2
- S.D F4 0 R1
- SUBI R1 R1 8
- BNEZ R1 Loop
- Multiply takes 4 clocks
- Loads have cache misses
27Loop Example Cycle 0
28Loop Example Cycle 1
29Loop Example Cycle 2
30Loop Example Cycle 3
31Loop Example Cycle 4
32Loop Example Cycle 5
33Loop Example Cycle 6
Load2
34Loop Example Cycle 7
35Loop Example Cycle 8
36Loop Example Cycle 9
37Loop Example Cycle 10
38Loop Example Cycle 11
39Loop Example Cycle 12
40Loop Example Cycle 13
41Loop Example Cycle 14
42Loop Example Cycle 15
43Loop Example Cycle 16
44Loop Example Cycle 17
45Loop Example Cycle 18
46Loop Example Cycle 19
47Loop Example Cycle 20
48Loop Example Cycle 21
49Summary of Tomasulos Algorithm
- Prevents registers as bottleneck
- Avoids WAR and WAW hazards of scoreboard
- Allows loop unrolling in hardware
- Not limited to basic blocks (provided we have
branch prediction) - Lasting contributions
- Dynamic scheduling
- Register renaming
- Load/store disambiguation
- Next Dynamic branch prediction