ECE6130: Computer Architecture: Instruction Level Parallelism ILP - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

ECE6130: Computer Architecture: Instruction Level Parallelism ILP

Description:

Why RAW is a 'real' hazard? What technique allows out-of-order ... Iter- ation. Count. 7. Loop Example Cycle 1. 8. Loop Example Cycle 2. 9. Loop Example Cycle 3 ... – PowerPoint PPT presentation

Number of Views:321
Avg rating:3.0/5.0
Slides: 49
Provided by: xubi
Category:

less

Transcript and Presenter's Notes

Title: ECE6130: Computer Architecture: Instruction Level Parallelism ILP


1
ECE6130 Computer ArchitectureInstruction Level
Parallelism (ILP)
  • Dr. Xubin He
  • http//www.ece.tntech.edu/hexb
  • Email hexb_at_tntech.edu
  • Tel 931-3723462, Brown Hall 319

2
  • Previous Class
  • Dynamic Scheduling
  • Tomasulo
  • Today
  • Tomasulo (Contd.)
  • Speculation
  • Speculative Tomasulo
  • Multiple instructions issue

3
Several Small Questions
  • Why RAW is a real hazard?
  • What technique allows out-of-order execution?

4
Several Small Questions
  • Why RAW is a real hazard?
  • WAW and WAR can be eliminated by register
    renaming if we have enough resources
  • RAW can only be avoided, not eliminated
  • EX. A Load instruction followed by an integer
    ALU instruction that directly uses the load
    result will always lead to a RAW hazard. SO
    compiler has to try to avoid such situations
  • What technique allows out-of-order execution?
  • Split ID pipe stage of simple 5-stage pipeline
    into two stages issue and Read operands?in-order
    issue, out-of-order execution and out-of-order
    completion

5
Tomasulo Loop Example
  • Loop LD F0 0 R1
  • MULTD F4 F0 F2
  • SD F4 0 R1
  • SUBI R1 R1 8
  • BNEZ R1 Loop
  • This time assume Multiply takes 4 clocks
  • Assume 1st load takes 8 clocks (L1 cache miss),
    2nd load takes 1 clock (hit)
  • To be clear, will not show clocks for SUBI, BNEZ
  • Reality integer instructions ahead of Floating
    Point Instructions
  • Show 2 iterations

6
Loop Example
7
Loop Example Cycle 1
8
Loop Example Cycle 2
9
Loop Example Cycle 3
  • Implicit renaming sets up data flow graph

10
Loop Example Cycle 4
  • Dispatching SUBI Instruction (not in FP queue)

11
Loop Example Cycle 5
  • And, BNEZ instruction (not in FP queue)

12
Loop Example Cycle 6
  • Notice that F0 never sees Load from location 80

13
Loop Example Cycle 7
  • Register file completely detached from
    computation
  • First and Second iteration completely overlapped

14
Loop Example Cycle 8
15
Loop Example Cycle 9
  • Load1 completing who is waiting?
  • Note Dispatching SUBI

16
Loop Example Cycle 10
  • Load2 completing who is waiting?
  • Note Dispatching BNEZ

17
Loop Example Cycle 11
  • Next load in sequence

18
Loop Example Cycle 12
  • Why not issue third multiply?

19
Loop Example Cycle 13
  • Why not issue third store?

20
Loop Example Cycle 14
  • Mult1 completing. Who is waiting?

21
Loop Example Cycle 15
  • Mult2 completing. Who is waiting?

22
Loop Example Cycle 16
23
Loop Example Cycle 17
24
Loop Example Cycle 18
25
Loop Example Cycle 19
26
Loop Example Cycle 20
  • Once again In-order issue, out-of-order
    execution and out-of-order completion.

27
Summary
  • Reservations stations implicit register renaming
    to larger set of registers buffering source
    operands
  • Prevents registers as bottleneck
  • Avoids WAR, WAW hazards
  • Allows loop unrolling in HW
  • Not limited to basic blocks (integer units gets
    ahead, beyond branches)
  • Today, helps cache misses as well
  • Dont stall for L1 Data cache miss (insufficient
    ILP for L2 miss?)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are Pentium III PowerPC 604
    MIPS R10000 HP-PA 8000 Alpha 21264

28
Speculation to greater ILP
  • Greater ILP Overcome control dependence by
    hardware speculating on outcome of branches and
    executing program as if guesses were correct
  • Speculation ? fetch, issue, and execute
    instructions as if branch predictions were always
    correct
  • Dynamic scheduling ? only fetches and issues
    instructions
  • Essentially a data flow execution model
    Operations execute as soon as their operands are
    available

29
Speculation to greater ILP
  • 3 components of HW-based speculation
  • Dynamic branch prediction to choose which
    instructions to execute
  • Speculation to allow execution of instructions
    before control dependences are resolved
  • ability to undo effects of incorrectly
    speculated sequence
  • Dynamic scheduling to deal with scheduling of
    different combinations of basic blocks

30
Adding Speculation to Tomasulo
  • Must separate execution from allowing instruction
    to finish or commit
  • This additional step called instruction commit
  • When an instruction is no longer speculative,
    allow it to update the register file or memory
  • Requires additional set of buffers to hold
    results of instructions that have finished
    execution but have not committed
  • This reorder buffer (ROB) is also used to pass
    results among instructions that may be speculated

31
Exceptions and Interrupts
  • Speculation guess and check
  • Important for branch prediction
  • Need to take our best shot at predicting branch
    direction.
  • If we speculate and are wrong, need to back up
    and restart execution to point at which we
    predicted incorrectly
  • This is exactly same as precise exceptions!
  • Technique for both precise interrupts/exceptions
    and speculation in-order commit
  • Exceptions are handled by not recognizing the
    exception until instruction that caused it is
    ready to commit in ROB
  • If a speculated instruction raises an exception,
    the exception is recorded in the ROB
  • This is why reorder buffers in all new processors

32
Reorder Buffer (ROB)
  • In Tomasulos algorithm, once an instruction
    writes its result, any subsequently issued
    instructions will find result in the register
    file
  • With speculation, the register file is not
    updated until the instruction commits
  • (we know definitively that the instruction should
    execute)
  • Thus, the ROB supplies operands in interval
    between completion of instruction execution and
    instruction commit
  • ROB is a source of operands for instructions,
    just as reservation stations (RS) provide
    operands in Tomasulos algorithm
  • ROB extends architectured registers like RS

33
Reorder Buffer Entry
  • Each entry in the ROB contains four fields
  • Instruction type
  • a branch (has no destination result), a store
    (has a memory address destination), or a register
    operation (ALU operation or load, which has
    register destinations)
  • Destination
  • Register number (for loads and ALU operations) or
    memory address (for stores) where the
    instruction result should be written
  • Value
  • Value of instruction result until the instruction
    commits
  • Ready
  • Indicates that instruction has completed
    execution, and the value is ready

34
Reorder Buffer operation
  • Holds instructions in FIFO order, exactly as
    issued
  • When instructions complete, results placed into
    ROB
  • Supplies operands to other instruction between
    execution complete commit ? more registers
    like RS
  • Tag results with ROB buffer number instead of
    reservation station
  • Instructions commit ?values at head of ROB placed
    in registers
  • As a result, easy to undo speculated
    instructions on mispredicted branches or on
    exceptions

Commit path
35
Recall 4 Steps of Speculative Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station and reorder buffer slot
    free, issue instr send operands reorder
    buffer no. for destination (this stage sometimes
    called dispatch)
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch CDB for result when both in
    reservation station, execute checks RAW
    (sometimes called issue)
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting FUs
    reorder buffer mark reservation station
    available.
  • 4. Commitupdate register with reorder result
  • When instr. at head of reorder buffer result
    present, update register with result (or store to
    memory) and remove instr from reorder buffer.
    Mispredicted branch flushes reorder buffer
    (sometimes called graduation)

36
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
Dest
Reservation Stations
FP adders
FP multipliers
37
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
38
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
39
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
Dest
Reservation Stations
1 10R2
5 0R3
FP adders
FP multipliers
40
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
Dest
Reservation Stations
1 10R2
5 0R3
FP adders
FP multipliers
41
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
6 ADDD M10,R(F6)
Dest
Reservation Stations
FP adders
FP multipliers
42
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
43
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
F2
DIVD F2,F10,F6
N
F10
ADDD F10,F4,F0
N
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
2 ADDD R(F4),ROB1
Dest
Reservation Stations
FP adders
FP multipliers
44
Avoiding Memory Hazards
  • WAW and WAR hazards through memory are eliminated
    with speculation because actual updating of
    memory occurs in order, when a store is at head
    of the ROB, and hence, no earlier loads or stores
    can still be pending
  • RAW hazards through memory are maintained by two
    restrictions
  • not allowing a load to initiate the second step
    of its execution if any active ROB entry occupied
    by a store has a Destination field that matches
    the value of the A field of the load, and
  • maintaining the program order for the computation
    of an effective address of a load with respect to
    all earlier stores.
  • these restrictions ensure that any load that
    accesses a memory location written to by an
    earlier store cannot perform the memory access
    until the store has written the data

45
Getting CPI below 1
  • CPI 1 if issue only 1 instruction every clock
    cycle
  • Multiple-issue processors come in 3 flavors
  • statically-scheduled superscalar processors,
  • dynamically-scheduled superscalar processors, and
  • VLIW (very long instruction word) processors
  • 2 types of superscalar processors issue varying
    numbers of instructions per clock
  • use in-order execution if they are statically
    scheduled, or
  • out-of-order execution if they are dynamically
    scheduled
  • VLIW processors, in contrast, issue a fixed
    number of instructions formatted either as one
    large instruction or as a fixed instruction
    packet with the parallelism among instructions
    explicitly indicated by the instruction (Intel/HP
    Itanium)

46
Multiple Issue Processors
  • Taking Advantage of More ILP with Multiple Issue
  • Superscalar issue varying numbers of
    instructions per cycle that are either statically
    scheduled (using compiler techniques, thus
    in-order execution) or dynamically scheduled
    (using techniques based on Tomasulos algorithm,
    thus out-order execution)
  • VLIW (very long instruction word) issue a fixed
    number of instructions formatted either as one
    large instruction or as a fixed instruction
    packet with the parallelism among instructions
    explicitly indicated by the instruction (hence,
    they are also known as EPIC, explicitly parallel
    instruction computers). VLIW and EPIC processors
    are inherently statically scheduled by the
    compiler.

47
Limits of ILP
  • Paper Limits of instruction-level parallelism,
    by David Wall, Nov 1993
  • What were defaults in number of instructions
    issued per clock cycle, instruction window size,
    execution latency, number of execution units?
  • How did loop unrolling change results?
  • How did realistic functional unit execution
    latencies change results?
  • Paper was written in 1993
  • Which ideas still too optimistic in 2008?
  • Which ideas seem tame in 2008?

48
Next
  • Memory Hierarchy
  • Read Chapter 5.1-5.3
Write a Comment
User Comments (0)
About PowerShow.com