Superscalar Design for Large Instruction Windows: the ROB and LSQ - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Superscalar Design for Large Instruction Windows: the ROB and LSQ

Description:

Superscalar Design for Large Instruction Windows: the ROB and LSQ. Sam Stone Kevin Michael Woley ... serial dependencies on the ROB and the serialize on those ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 19
Provided by: sams4
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Design for Large Instruction Windows: the ROB and LSQ


1
Superscalar Design for Large Instruction Windows
the ROB and LSQ
  • Sam Stone Kevin Michael Woley
  • ECE512 Final Project

2
Reorder Buffer
3
Classical ROB Uses
  • Recovery From Misspeculation Branch, Memory
  • State Retirement Register, Memory
  • Precise Interrupts and Exceptions
  • Resource Management
  • Storage for renamed registers, in the case where
    the ROB doubles as the physical register file
  • Mechanism whereby physical registers are
    reclaimed when the physical register file is
    separate from the ROB

4
Debunking the ROB
  • Does ROB size limit large instruction window,
    high ILP processors?
  • Large FIFOs are not necessarily hard to build,
    not the limiting factor
  • Why are we not building larger ROBs today?
  • The real performance limiter is not the ROB
    itself but the mechanisms that use the ROB
  • Misspeculation recovery, state retirement,
    precise interrupts, and physical resource
    reclamation all use the ROB in a serialized
    manner
  • Performance bound by rate of retirement, not ROB
    size
  • Key design goal for large instruction window
    machines
  • Break the unneeded serial dependencies on the ROB
    and the serialize on those which remain on
    granularities greater than a single instruction

5
Extending the ROB
  • Classical ROB Extensions
  • Retire more than an instruction per cycle.
    Addresses the ability to retire multiple
    instructions per cycle by allowing sequential
    instructions to retire concurrently.
  • Speculative Early Retirement of Instructions
  • Addresses sequential retirement of instructions,
    early resource reclamation and, to a second
    order, the ability to retire multiple
    instructions per cycle.
  • Misspeculation recovery sequentialized by a
    History Buffer or similar structure

6
Checkpoint Repair
  • Partial Checkpointing of State
  • Addresses non-sequential recovery from
    misspeculation as well as early resource
    reclamation.
  • MIPS R10000, Alpha 21264
  • Example Cherry
  • Point of No Return (PNR) Oldest instruction
    that can suffer a branch or memory misprediction

PNR
HEAD
TAIL
REVERSIBLE
IRREVERSIBLE
7
Full Checkpoint Repair
  • Checkpointed Processors
  • Addresses sequential instruction retirement,
    early resource reclamation, the ability to retire
    multiple instructions per cycle, and
    non-sequential recovery from misspeculation.
  • Problem cannot checkpoint at every instruction

ACTIVE
COMPLETE
COMPLETE NOT ASSOCIATED
CHKPT 2
CHKPT 3
CHKPT 1
OLDER INSTRUCTIONS
8
Load-Store Queues
9
Conventional LSQ
  • Enables speculative, out-of-order execution of
    loads and stores
  • Memory disambiguation
  • Store-to-load forwarding
  • In-order retirement of stores

10
Scaling the LSQ
  • Memory disambiguation and store-to-load
    forwarding do not scale as instruction window
    increases
  • High latency
  • High power consumption
  • Insufficient bandwidth
  • Search filtering decreases power consumption and
    increases bandwidth
  • Segmented or hierarchical LSQs decrease the
    latency of LSQ searches
  • Address-indexed LSQs decrease power and latency

11
Search Filtering (UT-Austin)
  • Bloom Filter Predictor (BFP)
  • Table of reference counters
  • Indexed by low-order bits of load/store address
  • Load Store BFP tracks number of in-flight loads
    stores to each address
  • When a load issues
  • Increments ref counter in Load BFP
  • Reads ref counter in Store BFP
  • If SBFP ref counter is zero, does not search LSQ
  • When a load retires
  • Decrements ref counter in Load BFP

12
Search Filtering (UT-Austin)
  • Analogous case for stores
  • Performance
  • Prevents 75 of mem instr from searching LSQ
  • Problems
  • Hash collisions in Load/Store BFPs
  • Reference counter saturation
  • Does not increase effective capacity of LSQ
  • Does not decrease latency of searching

13
A Segmented LSQ
  • LSQ is a circular queue of LSQ segments
  • Each segment is a conventional LSQ
  • Self-circular allocation policy
  • Search Policy Bandwidth-Latency Tradeoff
  • Parallel search decreases latency (Intel)
  • Serial search increases bandwidth (Purdue)

14
A Hierarchical Store Queue
  • Small, fast L1 store queue
  • Large, slow L2 store queue
  • Membership test buffer
  • Reference counts addresses in L2SQ
  • Fast lookup to determine if address is in L2SQ

15
Want More??
16
Address-Indexed Structures
  • The LSQ tracks load-store dependences by renaming
    the memory space.
  • Bypass values and recovery checkpoints are
    generated via searches.
  • Power consumption and latency increase
    dramatically as the number of in-flight loads and
    stores increase.
  • Might there be a low-power, low-latency
    alternative to the FIFO organization of the LSQ?

17
Address-Indexed Structures
  • Address-Indexed Memory Disambiguation Tables
    (MDTs)
  • Loads and stores use low-order bits of their
    addresses to index into the MDTs.
  • The MDTs store the sequence numbers of the
    latest in-flight load and store to each address.
  • Memory ordering violations are detected by
    comparing the sequence number of the issued
    load/store to the sequence number in the
    corresponding MDT.
  • Recovery with the MDT is more conservative than
    recovery with the LSQ.

18
Address-Indexed Structures
  • Address-Indexed Forwarding Cache
  • Forwarding cache holds the speculative values of
    in-flight memory addresses.
  • Each load accesses the forwarding cache as a
    level-0 cache in the cache-memory hierarchy.
  • The state of the forwarding cache may become
    corrupt (branch mispredictions, memory
    misspeculations). Policy for recovering from
    corrupt state is essential to high performance.
Write a Comment
User Comments (0)
About PowerShow.com