Title: ECE6130: Computer Architecture: Instruction Level Parallelism ILP
1ECE6130 Computer ArchitectureInstruction Level
Parallelism (ILP)
- Dr. Xubin He
- http//www.ece.tntech.edu/hexb
- Email hexb_at_tntech.edu
- Tel 931-3723462, Brown Hall 319
2- Previous Class
- Dynamic Scheduling
- Tomasulo
- Today
- Tomasulo (Contd.)
- Speculation
- Speculative Tomasulo
- Multiple instructions issue
3Several Small Questions
- Why RAW is a real hazard?
- What technique allows out-of-order execution?
4Several Small Questions
- Why RAW is a real hazard?
- WAW and WAR can be eliminated by register
renaming if we have enough resources - RAW can only be avoided, not eliminated
- EX. A Load instruction followed by an integer
ALU instruction that directly uses the load
result will always lead to a RAW hazard. SO
compiler has to try to avoid such situations - What technique allows out-of-order execution?
- Split ID pipe stage of simple 5-stage pipeline
into two stages issue and Read operands?in-order
issue, out-of-order execution and out-of-order
completion
5Tomasulo Loop Example
- Loop LD F0 0 R1
- MULTD F4 F0 F2
- SD F4 0 R1
- SUBI R1 R1 8
- BNEZ R1 Loop
- This time assume Multiply takes 4 clocks
- Assume 1st load takes 8 clocks (L1 cache miss),
2nd load takes 1 clock (hit) - To be clear, will not show clocks for SUBI, BNEZ
- Reality integer instructions ahead of Floating
Point Instructions - Show 2 iterations
6Loop Example
7Loop Example Cycle 1
8Loop Example Cycle 2
9Loop Example Cycle 3
- Implicit renaming sets up data flow graph
10Loop Example Cycle 4
- Dispatching SUBI Instruction (not in FP queue)
11Loop Example Cycle 5
- And, BNEZ instruction (not in FP queue)
12Loop Example Cycle 6
- Notice that F0 never sees Load from location 80
13Loop Example Cycle 7
- Register file completely detached from
computation - First and Second iteration completely overlapped
14Loop Example Cycle 8
15Loop Example Cycle 9
- Load1 completing who is waiting?
- Note Dispatching SUBI
16Loop Example Cycle 10
- Load2 completing who is waiting?
- Note Dispatching BNEZ
17Loop Example Cycle 11
18Loop Example Cycle 12
- Why not issue third multiply?
19Loop Example Cycle 13
- Why not issue third store?
20Loop Example Cycle 14
- Mult1 completing. Who is waiting?
21Loop Example Cycle 15
- Mult2 completing. Who is waiting?
22Loop Example Cycle 16
23Loop Example Cycle 17
24Loop Example Cycle 18
25Loop Example Cycle 19
26Loop Example Cycle 20
- Once again In-order issue, out-of-order
execution and out-of-order completion.
27Summary
- Reservations stations implicit register renaming
to larger set of registers buffering source
operands - Prevents registers as bottleneck
- Avoids WAR, WAW hazards
- Allows loop unrolling in HW
- Not limited to basic blocks (integer units gets
ahead, beyond branches) - Today, helps cache misses as well
- Dont stall for L1 Data cache miss (insufficient
ILP for L2 miss?) - Lasting Contributions
- Dynamic scheduling
- Register renaming
- Load/store disambiguation
- 360/91 descendants are Pentium III PowerPC 604
MIPS R10000 HP-PA 8000 Alpha 21264
28Speculation to greater ILP
- Greater ILP Overcome control dependence by
hardware speculating on outcome of branches and
executing program as if guesses were correct - Speculation ? fetch, issue, and execute
instructions as if branch predictions were always
correct - Dynamic scheduling ? only fetches and issues
instructions - Essentially a data flow execution model
Operations execute as soon as their operands are
available
29Speculation to greater ILP
- 3 components of HW-based speculation
- Dynamic branch prediction to choose which
instructions to execute - Speculation to allow execution of instructions
before control dependences are resolved - ability to undo effects of incorrectly
speculated sequence - Dynamic scheduling to deal with scheduling of
different combinations of basic blocks
30Adding Speculation to Tomasulo
- Must separate execution from allowing instruction
to finish or commit - This additional step called instruction commit
- When an instruction is no longer speculative,
allow it to update the register file or memory - Requires additional set of buffers to hold
results of instructions that have finished
execution but have not committed - This reorder buffer (ROB) is also used to pass
results among instructions that may be speculated
31Exceptions and Interrupts
- Speculation guess and check
- Important for branch prediction
- Need to take our best shot at predicting branch
direction. - If we speculate and are wrong, need to back up
and restart execution to point at which we
predicted incorrectly - This is exactly same as precise exceptions!
- Technique for both precise interrupts/exceptions
and speculation in-order commit - Exceptions are handled by not recognizing the
exception until instruction that caused it is
ready to commit in ROB - If a speculated instruction raises an exception,
the exception is recorded in the ROB - This is why reorder buffers in all new processors
32Reorder Buffer (ROB)
- In Tomasulos algorithm, once an instruction
writes its result, any subsequently issued
instructions will find result in the register
file - With speculation, the register file is not
updated until the instruction commits - (we know definitively that the instruction should
execute) - Thus, the ROB supplies operands in interval
between completion of instruction execution and
instruction commit - ROB is a source of operands for instructions,
just as reservation stations (RS) provide
operands in Tomasulos algorithm - ROB extends architectured registers like RS
33Reorder Buffer Entry
- Each entry in the ROB contains four fields
- Instruction type
- a branch (has no destination result), a store
(has a memory address destination), or a register
operation (ALU operation or load, which has
register destinations) - Destination
- Register number (for loads and ALU operations) or
memory address (for stores) where the
instruction result should be written - Value
- Value of instruction result until the instruction
commits - Ready
- Indicates that instruction has completed
execution, and the value is ready
34Reorder Buffer operation
- Holds instructions in FIFO order, exactly as
issued - When instructions complete, results placed into
ROB - Supplies operands to other instruction between
execution complete commit ? more registers
like RS - Tag results with ROB buffer number instead of
reservation station - Instructions commit ?values at head of ROB placed
in registers - As a result, easy to undo speculated
instructions on mispredicted branches or on
exceptions
Commit path
35Recall 4 Steps of Speculative Tomasulo Algorithm
- 1. Issueget instruction from FP Op Queue
- If reservation station and reorder buffer slot
free, issue instr send operands reorder
buffer no. for destination (this stage sometimes
called dispatch) - 2. Executionoperate on operands (EX)
- When both operands ready then execute if not
ready, watch CDB for result when both in
reservation station, execute checks RAW
(sometimes called issue) - 3. Write resultfinish execution (WB)
- Write on Common Data Bus to all awaiting FUs
reorder buffer mark reservation station
available. - 4. Commitupdate register with reorder result
- When instr. at head of reorder buffer result
present, update register with result (or store to
memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer
(sometimes called graduation)
36Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
Dest
Reservation Stations
FP adders
FP multipliers
37Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
38Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
39Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
Dest
Reservation Stations
1 10R2
5 0R3
FP adders
FP multipliers
40Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
Dest
Reservation Stations
1 10R2
5 0R3
FP adders
FP multipliers
41Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
6 ADDD M10,R(F6)
Dest
Reservation Stations
FP adders
FP multipliers
42Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
43Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
F2
DIVD F2,F10,F6
N
F10
ADDD F10,F4,F0
N
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
44Avoiding Memory Hazards
- WAW and WAR hazards through memory are eliminated
with speculation because actual updating of
memory occurs in order, when a store is at head
of the ROB, and hence, no earlier loads or stores
can still be pending - RAW hazards through memory are maintained by two
restrictions - not allowing a load to initiate the second step
of its execution if any active ROB entry occupied
by a store has a Destination field that matches
the value of the A field of the load, and - maintaining the program order for the computation
of an effective address of a load with respect to
all earlier stores. - these restrictions ensure that any load that
accesses a memory location written to by an
earlier store cannot perform the memory access
until the store has written the data
45Getting CPI below 1
- CPI 1 if issue only 1 instruction every clock
cycle - Multiple-issue processors come in 3 flavors
- statically-scheduled superscalar processors,
- dynamically-scheduled superscalar processors, and
- VLIW (very long instruction word) processors
- 2 types of superscalar processors issue varying
numbers of instructions per clock - use in-order execution if they are statically
scheduled, or - out-of-order execution if they are dynamically
scheduled - VLIW processors, in contrast, issue a fixed
number of instructions formatted either as one
large instruction or as a fixed instruction
packet with the parallelism among instructions
explicitly indicated by the instruction (Intel/HP
Itanium)
46Multiple Issue Processors
- Taking Advantage of More ILP with Multiple Issue
- Superscalar issue varying numbers of
instructions per cycle that are either statically
scheduled (using compiler techniques, thus
in-order execution) or dynamically scheduled
(using techniques based on Tomasulos algorithm,
thus out-order execution) - VLIW (very long instruction word) issue a fixed
number of instructions formatted either as one
large instruction or as a fixed instruction
packet with the parallelism among instructions
explicitly indicated by the instruction (hence,
they are also known as EPIC, explicitly parallel
instruction computers). VLIW and EPIC processors
are inherently statically scheduled by the
compiler.
47Limits of ILP
- Paper Limits of instruction-level parallelism,
by David Wall, Nov 1993 - What were defaults in number of instructions
issued per clock cycle, instruction window size,
execution latency, number of execution units? - How did loop unrolling change results?
- How did realistic functional unit execution
latencies change results? - Paper was written in 1993
- Which ideas still too optimistic in 2008?
- Which ideas seem tame in 2008?
48Next
- Memory Hierarchy
- Read Chapter 5.1-5.3