CSE 7381 - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

CSE 7381

Description:

Multimedia instructions being added to many processors ... mechanisms to restore a precise exception state before resuming execution ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 50
Provided by: koc52
Category:
Tags: cse | resuming

less

Transcript and Presenter's Notes

Title: CSE 7381


1
Lecture 7More ILP with Multiple Issue and
Speculation
  • Prof. Fatih Koçan
  • CSE 7/5381 Computer Architecture
  • Fall 2002

2
Getting CPI lt 1 Issuing Multiple
Instructions/Cycle
  • Vector Processing Explicit coding of independent
    loops as operations on large vectors of numbers
  • Multimedia instructions being added to many
    processors
  • Superscalar varying no. instructions/cycle (1 to
    8), scheduled by compiler or by HW (Tomasulo)
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
    III/4
  • (Very) Long Instruction Words (V)LIW fixed
    number of instructions (4-16) scheduled by the
    compiler put ops into wide templates
  • Intel Architecture-64 (IA-64) 64-bit address
  • Renamed Explicitly Parallel Instruction
    Computer (EPIC)
  • Anticipated success of multiple instructions lead
    to Instructions Per Clock cycle (IPC) vs. CPI

3
Getting CPI lt 1 Superscalar Processors
  • Issue varying number of instructions per clock
    (18)
  • Statically scheduled by compiler
  • In-order execution
  • Dynamically scheduled by hardware
  • Use techniques based on Tomasulos Algorithm
  • Out-of-order execution

4
VLIW Very Long Instruction Word
  • Issue fixed number of instructions
  • Two Formats
  • One large instruction
  • A fixed instruction packet
  • The parallelism among instruction is explicitly
    indicated by the instruction
  • EPIC explicitly parallel instruction computers
  • EPIC VLIW processors statically scheduled by
    the compiler

5
Multi-issue Approaches
6
Statically Scheduled Superscalar Processors
  • HW might issue (0?8) instructions/cycle
  • In-order issue
  • Arbitrary K-issue
  • any combination of K instructions in any order
  • Non arbitrary K-issue
  • e.g. K/2 integer, K/2 float instructions
  • All pipeline hazards are checked for at the issue
    stage
  • Check for the hazards
  • Among instructions in Is, and among Is and IE

7
The Process of Instruction Issue
  • K-issue, dynamically scheduled superscalar
    processor

Issue Packet 0? I ? K
IPreF
IF
IS1
EX
IS2
  • IPreF Prefetches instructions for superscalar
  • IF Conceptually, IF examines each instruction in
    the Issue Packet for hazards in program order
  • IS1 Decides how many instruction from the packet
    can be issued simultaneously
  • IS2 Examines the selected instructions in IS1
    with already
  • issued instructions for hazards

8
ISSUE Stage
  • Complex, determines the pipeline cycle time
  • ISSUE stage is pipelined to issue instructions
    every cycle
  • Many statically scheduled and all dynamically
    scheduled superscalars have pipelined issue stage
  • Higher branch penalties
  • Increase issue rate ? further pipeline IS stage
  • (not easy!)
  • Limitation on clock rate of superscalars

9
A Statically Scheduled Superscalar MIPS
  • Issue 2 instructions/cycle 1 FP 1 Anything
  • 1 Anything LD, LDD, SD, SDD, BR, Int ALU, FP
    Move
  • Fetch 64 bits/cycle
  • Can only issue 2nd instruction if 1st instruction
    issues ? in-order issue
  • HP 7100, Desktops
  • Arbitrary Dual issue
  • Any combination of two instructions
  • Embedded processors

10
Statically Scheduled Superscalar MIPS
  • Superscalar MIPS 2 instructions,
  • Fetch 64-bits/clock cycle ltINT, FPgt or
    ltFP,INTgt
  • More ports for FP registers to do FP load FP
    op in a pair
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX EX EX
    WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX EX EX
    WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX EX EX
    WB
  • 1 cycle load delay expands to 3 instructions in
    SS
  • instruction in right half cant use it, nor
    instructions in next slot

11
Different Issue Combinations
  • Type Pipe Stages
  • FP instruction IF ID EX EX EX WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX EX EX WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX EX EX
    WB
  • Int. instruction IF ID EX MEM WB

Type Pipe Stages FP instruction IF ID EX EX
EX WB Int. instruction IF ID EX M
EM WB Int. instruction IF ID EX MEM WB FP
instruction IF ID EX EX EX WB FP
instruction IF ID EX EX EX
WB Int. instruction IF ID EX MEM WB
12
Issue Process of 2-Issue MIPS
  • Fetch two instructions from the Prefetch unit or
    from the cache
  • Determine how many instructions can be issued 0,
    1 or 2
  • Issue them to correct functional units

13
Fetching Two Instructions from I-Cache
  • Easy two fetch I1 I2
  • How about I2 I3 ?
  • Most processors issue only I2
  • Use a prefetch unit

14
2-Issue MIPS Hazard Checking
  • Potential Issue Packets
  • , INT, FP, INT, FP, FP,INT
  • Most hazard possibilities are eliminated within
    an Issue Packet
  • FP load/store/move, FP FP register port
    contention
  • RAW hazard
  • WAR, WAW hazards across issue packet boundaries

15
Additional Hardware for Superscalars
  • Enhanced hazard detection
  • Minimized hardware support to execute integer and
    floating point ins. in parallel
  • Different set of FP registers
  • Different set of Int registers
  • One additional FP read/write port
  • A larger set of bypass paths

16
Maintaining Precise Exception
Issue packet
  • Let the FP pipeline drain
  • DIV.D causes an exception after SUB.D exception
  • No precise exception at the HW level
  • Why? ADD.D destroys its one of operands
  • Approaches
  • Ignore the problem and settle for imprecise
    exceptions
  • Buffer the results of an operation until all the
    operations that were issued earlier are complete
  • Let Trap-handling routine to create a precise
    sequence for the exception
  • Allow the instruction issue to continue only if
    all the instructions before this instruction will
    complete w/o causing an exception

17
1. Settle for Imprecise Exceptions
  • Virtual memory and the IEEE FP-standard ? require
    precise exception
  • Two modes of execution
  • Imprecise mode (fast)
  • Precise mode
  • a mode switch or by insertion of explicit
    FP-exception test instructions
  • The amount of overlap and reordering is
    significantly restricted
  • DEC Alpha 21064 21164, IBM Power I II, MIPS
    R8000

18
2. Buffering the Results of an Operation
  • The difference in running times is large
  • The number of results to buffer becomes large
  • The results from the queue must be bypassed to
    all issuing and executing instructions
  • Large number of comparators and a very large
    multiplexor

19
2. Buffering Variations
  • History File (CYBER 180/990, VAX)
  • Keeps track of the original values of registers
  • Upon an exception, unroll back and load the
    original values from the file
  • Future File
  • Keeps the newer value of register
  • Update the main register file from the future
    file after all earlier instructions complete
  • On an exception main reg file is intact!

20
3. Trap-handling routine to create a precise
sequence of exceptions
  • Know what operations in the pipeline and their
    PCs
  • The software finishes any instructions that
    precede the latest instruction completed
  • I1 -- long , causes an exception
  • I2 In-1 not completed
  • In completed
  • SW simulate I1 In ? major difficulty
  • HW restart at In1

21
4. All Instructions Before the Issuing Complete
w/o Exception
  • Stall the CPU to maintain precise exceptions
  • FP-functional units must determine if an
    exception is possible early in EX stage
  • In the first 3 clock cycles in MIPS pipeline
  • MIPS R2000/3000/4000, Intel Pentium

22
Precise Exception Handling in SS MIPS
  • Int. op finishes before FP op
  • Integer instruction completes before FP op
    exception detection
  • Imprecise exception
  • Solutions
  • Detecting FP exceptions early
  • Using software mechanisms to restore a precise
    exception state before resuming execution
  • Delaying instruction completion until an
    exception is impossible

Issue packet
The speculation approach uses 3
23
Load Branch Stalls in SS MIPS
  • Load result is not available
  • on the same cycle
  • on the next cycle
  • Branch delay for taken branch
  • 2 instructions if the branch is the first in the
    packet
  • 3 instructions if the branch is the second in the
    packet

Total 3 instructions
24
Multiple Issue Challenges
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations AND No hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-scalar gt examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue (N-issue O(N2-N) comparisons)
  • Register file need 2x reads and 1x writes/cycle
  • Rename logic must be able to rename same
    register multiple times in one cycle! For
    instance, consider 4-way issue
  • add r1, r2, r3 add p11, p4, p7 sub r4, r1,
    r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
    4(p22) add r5, r1, r2 add p12, p23, p4
  • Imagine doing this transformation in a single
    cycle!
  • Result buses Need to complete multiple
    instructions/cycle
  • So, need multiple buses with associated matching
    logic at every reservation station.
  • Or, need multiple forwarding paths

25
Multiple Instruction Issue with Dynamic Scheduling
  • Issue an instruction in half of a cycle
  • Two instruction is processed in one cycle
  • Build necessary logic to handle two instruction
    at once
  • Any possible dependences between two instructions
  • Both approaches are used at the same
  • Pipeline widen issue logic
  • Integrate dynamic branch prediction into a
    dynamically scheduled pipeline

26
A two-issue dynamic scheduled processor
  • Issue any pair of instructions if reservation
    station is available
  • Extend Tomasulos scheme to deal with both
    integer and FP functional units and registers
  • Issue write result take 1 cycle each
  • There are a dynamic branch prediction hardware,
    a branch condition evaluation unit, 1 int. ALU,
    pipelined FP units
  • LOOP L.D F0, 0(R1) F0array element
  • ADD.D F4, F0, F2 add scalar in F2
  • S.D F4, 0(R1) store result
  • DADDIU R1, R1, -8 decrement pointer
  • 8 bytes (per DW)
  • BNE R1, R2, LOOP branch R1 ! R2

27
Latencies
  • Producer Consumer Cycles
  • ALU op ALU op 1
  • Load FP op 2
  • Load ALU op 2
  • FP Add FP Add 3
  • Branch prediction is perfect
  • Two CDBs
  • No delayed branch

28
First 3 iterations
IPC 5/31.67 Execution rate15/160.94
29
Resource Usage
Do we need second CDB?
30
2-issue w/ additional resources
  • extra adder for effective address calculation

IPC 5/31.67 Execution rate15/121.25
31
Resource Usage
Lower efficiency as measured by the utilization
of the functional unit
32
What limits the performance of 2-issue
dynamically scheduled pipeline?
  • Imbalance between the functional unit structure
    of the pipeline and the example loop
  • Impossibly to fully use the FP units
  • Need fewer dependent integer operations/loop
  • Very high loop overhead (2/5)
  • Try to reduce this overhead next chapter
  • The control hazard, could not start next L.D
    before we know the outcome of the branch next

33
Hardware-based Speculation
  • Every cycle execute a branch
  • Prediction is not sufficient to have high amount
    of ILP
  • Overcome control dependence by speculating on the
    outcome of branches
  • Execute the program as if our guesses were
    correct
  • Dynamic scheduling Fetch, Issue (No execute)
  • Speculation Fetch, Issue, Execute
  • Incorrect speculation ? Undo

34
Hardware-based Speculation Key Ideas
  • Dynamic branch prediction to choose which
    instructions to execute
  • Speculation to allow the execution of
    instructions before the control dependence is
    resolved
  • Dynamic scheduling to deal with the scheduling of
    different combinations of basic blocks
  • PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel
    Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon

35
Speculative Tomasulos Algo
  • Separate bypassing of speculative and
    non-speculative results
  • Undo possible
  • Instruction is no longer speculative, then
    updates register or memory
  • Instruction Commit Stage
  • Key idea out-of-order execution, in-order-commit
  • Use Reorder buffer (ROB) for in-order commit

36
Reorder Buffer (ROB)
  • Holds the results of instructions that executed
    but not committed
  • Passes results among instructions that may be
    speculated
  • Like Store buffer in Tomasulo

ROB is the source
Execution completes
Commits
37
Reorder Buffer Structure
38
Tomasulo With Reorder buffer
Done?
FP Op Queue
ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1
Newest
Reorder Buffer
Oldest
F0
LD F0,10(R2)
N
Registers
To Memory
Dest
from Memory
Dest
Dest
Reservation Stations
FP adders
FP multipliers
Prof. John Kubiatowiczs slide
39
Four-steps in Speculative Tomasulo
  • Issue (dispatch)
  • Get an instruction from the queue issue it if
    there is empty Reservation Station and ROB slot.
    Otherwise stalls. Send operands to the Res.Stat.
    if operands are available in the ROB or the
    registers. Send the ROB to reservation station.
    Later, RS puts result and tag on CDB.
  • Execute (issue)
  • Wait for the not ready operands by watching CDB,
    i.e. checks for structural hazards
  • Loads take 2 steps check if in the head of Load
    buffer, and reads from the mem.
  • Stores effective address calculation.
  • Write Result
  • Put the result with ROB tag on CDB all waiting
    reservation stations and ROB read from CDB
  • STORE write available value to a ROB slot, not
    available watch for CDB to update value field of
    ROB slot
  • Commit (completion, graduation)
  • BRANCH w/ incorrect prediction Branch w/
    incorrect prediction reaches the head of the ROB
    flush ROB, start execution at the correct
    successor of branch
  • STORE Store reaches the head of the ROB and
    result is available ? normal commit write to a
    memory
  • Any other instruction instruction reaches the
    head of the ROB and result is available ? normal
    commit write to a register

40
Speculative Example
  • L.D F6, 34(R2)
  • L.D F2, 45(R3)
  • MUL.D F0, F2, F4
  • SUB.D F8, F6, F2
  • DIV.D F10, F0, F6
  • ADD.D F6, F8, F2

Latencies Add 2 Mult 10 Divide 40
41
Speculative Example
Reorder buffer
F0 F1 F2 F3 F4 F5 F6
F7 F8 F9 F10 Reorder 3
6 4 5
42
Speculative Example
Reservation Stations
43
Speculative Loop Example
LOOP L.D F0, 0(R1)
F0array element ADD.D F4, F0, F2
add scalar in F2 S.D F4, 0(R1) store
result DADDIU R1, R1, -8 decrement
pointer 8 bytes (per DW) BNE R1,
R2, LOOP branch R1 ! R2
44
Speculative Loop Example
Reorder buffer
FP register status
F0 F1 F2 F3 F4 F5 F6
F7 F8 F9 F10 Reorder 6
7
45
Speculative Dynamic Scheduling Summary
  • Record speculative exception in the ROB
  • Check for exception when instruction is ready to
    commit
  • Complicated control over non-speculative Tomasulo
  • Stores updates memory when reaches
  • Write Results stage in Tomasulo
  • the head of the ROB in speculative Tomasulo
  • Store waits in Write Results stage for source
    operand
  • Move value from Stores reservation station to
    Stores ROB
  • In reality, the sourcing instruction directly
    puts into Stores ROB by searching waiting
    stores in the ROB
  • WAW and WAR memory hazards are eliminated
  • Actual memory update occurs in order
  • RAW memory hazards
  • The computation of an effective address of a load
    w.r.t. all earlier stores is ordered
  • Load cannot initiate reading from memory (step 2)
    if any active ROB entry occupied by a store has a
    Destination field that matches the value of the
    Address field of the load

46
Load/Store RAW Hazard
  • Question Given a load that follows a store in
    program order, are the two related?
  • (Alternatively is there a RAW hazard between the
    store and the load)? Eg st 0(R2),R5
    ld R6,0(R3)
  • Can we go ahead and start the load early?
  • Store address could be delayed for a long time by
    some calculation that leads to R2 (divide?).
  • We might want to issue/begin execution of both
    operations in same cycle.
  • Answer is that we are not allowed to start load
    until we know that address 0(R2) ? 0(R3)

47
Hardware Support for Memory Disambiguation
  • Need buffer to keep track of all outstanding
    stores to memory, in program order.
  • Keep track of address (when becomes available)
    and value (when becomes available)
  • FIFO ordering will retire stores from this
    buffer in program order
  • When issuing a load, record current head of store
    queue (know which stores are ahead of you).
  • When have address for load, check store queue
  • If any store prior to load is waiting for its
    address, stall load.
  • If load address matches earlier store address
    (associative lookup), then we have a
    memory-induced RAW hazard
  • store value available ? return value
  • store value not available ? return ROB number of
    source
  • Otherwise, send out request to memory
  • Actual stores commit in order, so no worry about
    WAR/WAW hazards through memory.

48
Multiple Issue w/ Speculation
  • Assign multiple reservation stations and reorder
    buffers to the instructions
  • Challenges
  • Instruction issue monitoring the CDBs for
    instruction completion
  • Handle multiple instruction commits/cycle

49
Non-speculative vs. Speculative
  • Loop LD R2, 0(R1)
  • DADDIU R2, R2, 1
  • SD R2, 0(R1)
  • DADDIU R1, R1, 4
  • BNE R2, R3, Loop
  • Separate units for effective address calculation,
    for ALU operations, for branch condition
    evaluation
  • Up to 2 instructions of any time can commit per
    clock
  • The branch is a key performance limitation

50
Design Considerations for Speculative Machines
  • Register renaming vs. Reorder buffers
  • A large set of registers (architectural vs.
    physical registers)
  • How much to speculate
  • Handle only low-cost exceptional events in
    speculative mode
  • 1st cache miss vs. 2nd level miss
  • Speculating through Multiple Branches
  • Very high branch frequency, significant
    clustering of branches, long delays in FUs
Write a Comment
User Comments (0)
About PowerShow.com