Lecture 6: ILP Techniques - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 6: ILP Techniques

Description:

... sequentially by an IFU that operates independently from datapath control. ... Op Operation to perform in the unit (e.g., or ) Fi Destination register ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 24
Provided by: david2988
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 6: ILP Techniques


1
Lecture 6 ILP Techniques
  • Laxmi N. Bhuyan
  • CS 162
  • Spring 2003

2
HW Schemes Instruction Parallelism
  • Why in HW at run time?
  • Works when cant know real dependence at compile
    time
  • Compiler simpler
  • Code for one machine runs well on another
  • Key idea Allow instructions behind stall to
    proceed
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F12,F8,F14
  • Enables out-of-order execution gt out-of-order
    completion
  • ID stage checks for hazards. If no hazards, issue
    the instn for execution. Scoreboard dates to CDC
    6600 in 1963

3
How ILP Works
  • Issuing multiple instructions per cycle would
    require fetching multiple instructions from
    memory per cycle gt called Superscalar degree or
    Issue width
  • To find independent instructions, we must have a
    big pool of instructions to choose from, called
    instruction buffer (IB). As IB length increases,
    complexity of decoder (control) increases that
    increases the datapath cycle time
  • Prefetch instructions sequentially by an IFU that
    operates independently from datapath control.
    Fetch instruction (PC)L, where L is the IB size
    or as directed by the branch predictor. (See
    Fig. 6-1 Pentium diagram)

4
Pentium Datapath
  • Pentium consists of two pipes (U-pipe and V-pipe)
    operating in parallel. U-pipe contains an 8-stage
    FP pipeline (see Pentium Figure)
  • Two stages of Decode Decode and control one
    stage Register read 2nd stage
  • See I-cache and D-cache in Fig. 6-1. What is TLB?
    How does the Virtual memory work?

5
HW Schemes Instruction Parallelism
  • Two types Scoreboard and Tomasulo
  • Scoreboard (EX PENTIUM)
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read operandswait until no data hazards, then
    read operands
  • Scoreboards allow instruction to execute whenever
    there is no structural hazard or not waiting for
    prior instructions. So the instructions are
    issued in order, but can bypass the waiting
    instructions in the read operand stage gt
    In-order issue Out-of-Order execution gt
    Out-of-Order completion
  • Named after CDC 6600 Scoreboard, which developed
    this capability

6
Scoreboard Implications
  • Scoreboard replaces ID, EX, WB with 4 stages
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR gt Wait at the WB stage until
    the other instruction completes
  • For WAW, must detect hazard at the ID stage
    stall until other completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard keeps track of dependencies, state or
    operations

7
Four Stages of Scoreboard Control
  • 1. Issuedecode instructions check for
    structural hazards (ID1)
  • If a functional unit for the instruction is
    free and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure. If a
    structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.
  • 2. Read operandswait until no data hazards, then
    read operands (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit. If the source operands are available for an
    instn, the scoreboard tells the functional unit
    to proceed to read the operands from the
    registers and begin execution. The scoreboard
    resolves RAW hazards dynamically in this step,
    and instructions may be sent into execution out
    of order.

8
Four Stages of Scoreboard Control
  • 3. Executionoperate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • 4. Write resultfinish execution (WB)
  • Once the scoreboard is aware that the
    functional unit has completed execution, the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction.
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • CDC 6600 scoreboard would stall SUBD until ADDD
    reads operands

9
Design of the Scoreboard
  • 1. Instruction statuswhich of 4 steps the
    instruction is in
  • 2. Functional unit statusIndicates the state of
    the functional unit (FU). 9 fields for each
    functional unit
  • BusyIndicates whether the unit is busy or not
  • OpOperation to perform in the unit (e.g., or
    )
  • FiDestination register
  • Fj, FkSource-register numbers
  • Qj, QkFunctional units producing source
    registers Fj, Fk
  • Rj, RkFlags indicating when Fj, Fk are ready
  • 3. Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register

10
Detailed Scoreboard Pipeline Control
11
CDC 6600 Scoreboard
  • Speedup 1.7 from compiler 2.5 by hand BUT slow
    memory (no cache) limits benefit
  • Limitations of 6600 scoreboard
  • No forwarding hardware
  • Limited to instructions in basic block (small
    window)
  • Small number of functional units (structural
    hazards), especailly integer/load store units
  • Do not issue on structural hazards
  • Wait for WAR hazards
  • Prevent WAW hazards

12
Summary
  • Instruction Level Parallelism (ILP) in SW or HW
  • Loop level parallelism is easiest to see
  • SW parallelism dependencies defined for program,
    hazards if HW cannot resolve
  • SW dependencies/compiler sophistication determine
    if compiler can unroll loops
  • Memory dependencies hardest to determine
  • HW exploiting ILP
  • Works when cant know dependence at run time
  • Code for one machine runs well on another
  • Key idea of Scoreboard Allow instructions behind
    stall to proceed (Decode gt Issue instr read
    operands)
  • Enables out-of-order execution gt out-of-order
    completion
  • ID stage checked both for structural

13
Tomasulo Algorithm(Implemented in IBM 360/91 in
1966)
  • Control buffers distributed with Function Units
    (FU) vs. centralized in scoreboard
  • FU buffers called reservation stations have
    pending operands
  • Registers in instructions replaced by values or
    pointers to reservation stations(RS) called
    register renaming
  • avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cant
  • Results to FU from RS, not through registers,
    over Common Data Bus that broadcasts results to
    all FUs
  • Load and Stores treated as FUs with RSs as well
  • Integer instructions can go past branches,
    allowing FP ops beyond basic block in FP queue

14
Tomasulo Organization
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
15
Reservation Station Components
  • OpOperation to perform in the unit (e.g., or
    )
  • Vj, VkValue of Source operands
  • Store buffers has V field, result to be stored
  • Qj, QkReservation stations producing source
    registers (value to be written)
  • Note No ready flags as in Scoreboard Qj,Qk0 gt
    ready
  • Store buffers only have Qi for RS producing
    result
  • BusyIndicates reservation station or FU is
    busy
  • Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions that
    will write that register.

16
Three Stages of Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), control issues instr sends operands
    (renames registers).
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch Common Data Bus for result
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting units
    mark reservation station available
  • Normal data bus data destination (go
    to bus)
  • Common data bus data source (come from bus)
  • 64 bits of data 4 bits of Functional Unit
    source address
  • Write if matches expected Functional Unit
    (produces result)
  • Does the broadcast

17
Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)
  • Pipelined Functional Units Multiple Functional
    Units
  • (6 load, 3 store, 3 , 2 x/) (1 load/store, 1
    , 2 x, 1 )
  • window size 14 instructions 5 instructions
  • No issue on structural hazard same
  • WAR renaming avoids stall completion
  • WAW renaming avoids stall completion
  • Broadcast results from FU Write/read registers
  • distributed reservation stations central
    scoreboard

18
Tomasulo Drawbacks
  • Complexity
  • delays of 360/91, MIPS 10000, IBM 620?
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
  • Multiple CDBs gt more FU logic for parallel assoc
    stores

19
Tomasulo Summary
  • Reservations stations renaming to larger set of
    registers buffering source operands
  • Prevents registers as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks (integer units gets
    ahead, beyond branches)
  • Helps cache misses as well
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are Pentium II PowerPC 604
    MIPS R10000 HP-PA 8000 Alpha 21264

20
HW support for More ILP
HW support for More ILP
  • Speculation allow an instruction to issue that
    is dependent on branch predicted to be taken
    without any consequences (including exceptions)
    if branch is not actually taken (HW undo)
    called boosting
  • Combine branch prediction with dynamic scheduling
    to execute before branches resolved
  • Separate speculative bypassing of results from
    real bypassing of results
  • When instruction no longer speculative, write
    boosted results (instruction commit)or discard
    boosted results
  • execute out-of-order but commit in-order to
    prevent irrevocable action (update state or
    exception) until instruction commits

21
HW support for More ILP
  • Need HW buffer for results of uncommitted
    instructions reorder buffer
  • 3 fields instr, destination, value
  • Reorder buffer can be operand source gt more
    registers like RS
  • Use reorder buffer number instead of reservation
    station when execution completes
  • Supplies operands between execution complete
    commit
  • Once operand commits, result is put into
    register
  • Instructions commit in order
  • As a result, its easy to undo speculated
    instructions on mispredicted branches or on
    exceptions

Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
22
Four Steps of Speculative Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station and reorder buffer slot
    free, issue instr send operands reorder
    buffer no. for destination (this stage sometimes
    called dispatch)
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch CDB for result when both in
    reservation station, execute checks RAW
    (sometimes called issue)
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting FUs
    reorder buffer mark reservation station
    available.
  • 4. Commitupdate register with reorder result
  • When instr. at head of reorder buffer result
    present, update register with result (or store to
    memory) and remove instr from reorder buffer.
    Mispredicted branch flushes reorder buffer
    (sometimes called graduation)

23
Renaming Registers
  • Common variation of speculative design
  • Reorder buffer keeps instruction information but
    not the result
  • Extend register file with extra renaming
    registers to hold speculative results
  • Rename register allocated at issue result into
    rename register on execution complete rename
    register into real register on commit
  • Operands read either from register file (real or
    speculative) or via Common Data Bus
  • Advantage operands are always from single source
    (extended register file)
Write a Comment
User Comments (0)
About PowerShow.com