Reducing Issue Logic Complexity in Superscalar Microprocessors - PowerPoint PPT Presentation

About This Presentation
Title:

Reducing Issue Logic Complexity in Superscalar Microprocessors

Description:

Reducing Issue Logic Complexity in Superscalar Microprocessors ... Budget / Deluxe speculatively woken up scheduling. Ideal 1 cycle scheduling pipeline ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 33
Provided by: Gan950
Category:

less

Transcript and Presenter's Notes

Title: Reducing Issue Logic Complexity in Superscalar Microprocessors


1
Reducing Issue Logic Complexity in Superscalar
Microprocessors
  • Survey Project
  • CprE 585 Advanced Computer Architecture
  • David Lastine
  • Ganesh Subramanian

2
Introduction
  • The ultimate goal of any computer architect
    designing a fast machine
  • Approaches
  • Increasing clocking rate (Help from VLSI)
  • Increasing bus width
  • Increasing pipeline depth
  • Superscalar architectures
  • Tradeoffs between hardware complexity and clock
    speed
  • Given a particular technology, the more complex
    the hardware, the lesser is the clocking rate

3
A New Paradigm
  • Retaining the effective functionality of complex
    superscalar processors
  • Target the bottleneck in present day
    microprocessors
  • Instruction scheduling is the throughput limiter
  • Need to effectively handle register renaming,
    issue window and wakeup selector
  • Increase the clocking rate
  • Rethinking circuit design methodologies
  • Modifying architectural design strategies
  • Wanting to have the cake and eat it too?
  • Aim at reducing power consumption too

4
Approaches to Handle Issue Logic Complexity
  • Performance IPC Clock Frequency
  • Pipelining scheduling logic reduces the IPC
  • Non-pipelined scheduling logic reduces clocking
    rate
  • Architectural solutions
  • Non-pipelined scheduling with dependence queue
    based issue logic Complexity Effective 1
  • Pipelined scheduling with speculative wakeup 2
  • Generic speed up and power conservation using tag
    elimination 3

5
Baseline Superscalar Model
  • The rename and the wake-up select stages of the
    generic superscalar pipeline model need to be
    targeted
  • Consider VLSI effects and decide to redesign a
    particular design component

6
Analyzing Baseline Implementations
  • Physical layout implementation of microprocessor
    circuits optimized for speed
  • Usage of dynamic logic for bottleneck circuits
  • Manual sizing of transistors in critical path
  • Logic optimizations like two level decomposition
  • Components analyzed
  • Register rename logic
  • Wakeup Logic / Issue window
  • Selection logic
  • Bypass logic

7
Register Rename Logic
  • RAM vs. CAM
  • Focus on RAM due to scalability
  • Decreasing feature sizes do not correspondingly
    scale down wire delays, but only logic delays
  • Delay relation with issue width is quadratic, but
    effectively linear
  • Need to handle wordline and bitline delays in
    future

8
Wakeup Logic
  • CAM is preferred
  • Tag drive times are quadratic functions of window
    size as well as issue width
  • Matching times are quadratic functions of issue
    width only
  • All delays are effectively linear for considered
    design space
  • Need to handle broadcast operation delays in
    future

9
Selection Logic
  • Tree of arbiters
  • Requests flow down while functional unit grants
    flow up to the issue window
  • Necessity of a selection policy (Oldest First /
    Leftmost First)
  • Delays proportional to the logarithm of the
    window size
  • All delays considered are logic delays

10
Bypass Logic
  • Number of bypass paths dependent upon pipeline
    depth (linear) and issue width (quadratic)
  • Composed of operand muxes and buffer drivers
  • Delays are quadratically proportional to length
    of result wires and hence issue width
  • Insignificant compared to other delays as feature
    size reduces

11
Complexity Effective Microarchitecture Design
Premises
  • Retain benefits of complex issue schemes but
    enable faster clocking
  • Design assumption Should not pipeline wakeup
    select, or data bypassing, as these are atomic
    operations (if dependent instruction should be
    executable in consecutive cycles)

12
Dependence Based Microarchitecture
  • Replace Issue Window by FIFOs with each queue
    composed of dependent instructions
  • Steer instructions to the appropriate FIFO in
    rename stage using heuristics
  • SRC_FIFO and Reservations Tables to handle
    dependencies and wakeup
  • IPC reduces but clocking rate increases to give a
    faster implementation

13
Clustering Dependence Based Microarchitectures
  • Reducing bypass delays by reducing length of
    bypass paths
  • Minimization of inter-cluster communication,
    extra cycle penalty otherwise
  • Clustered Microarchitecture Types
  • Single Window, Execution Driven Steering
  • Two Windows, Dispatch Driven Steering - Best
  • Two Windows, Random Steering

14
Pipelining Dynamic Instruction Scheduling Logic
  • WakeupSelect was held atomic in previous
    implementation
  • Increase performance by pipelining it, but retain
    execution of dependent instruction in consecutive
    cycles
  • Speculate on the wakeup by predicting based on
    both parent and grandparent instructions
  • Integrated into the Tomasulo approach

15
Wakeup Logic Details
  • Tag broadcast as soon as instruction begins
    execution
  • Broadcast Execution Completion latency
    specified as shown
  • Match bit acts as the sticky bit to enable delay
    countdown
  • Need not always be correct due to unexpected
    stalls
  • Select logic remains as in previous work

16
Pipelining Rename Logic
  • Assumption by child instruction that parent would
    broadcast its tag in the next cycle, IF
    grandparent instructions broadcasts tag
  • Speculative wakeup on grandparent tag receiving
    for selection in the next cycle
  • Speculative since parent selection for execution
    is not guaranteed
  • Modifications in rename map and dependency
    analysis logic

17
Wakeup and Select Logic
  • Wakeup request sent after looking into ready bits
    from the parents and grandparents tags
  • A multi-cycle parents field can be ignored
  • In addition to speculative readiness signified by
    request line, a confirm line is activated when
    all parents are ready
  • False selection involve non-confirmed requests
  • Problematic only when really ready instructions
    are not selected

18
Implementation Experimentation Details
  • Usage of a cycle accurate execution driven
    simulator for the Alpha ISA
  • Baseline conventional scheduled (2) pipeline
  • Budget / Deluxe speculatively woken up
    scheduling
  • Ideal 1 cycle scheduling pipeline
  • Factors like issue width and reservation station
    depth considered
  • Significant reduction in critical path with minor
    IPC impacts
  • Enables higher clock frequencies, deeper
    pipelines and larger instruction windows for
    better performance

19
Paradigm shift
  • So far weve added hardware to improve
    performance
  • However issue window could also be improved by
    removing hardware

20
Current Situation of Issue Windows
  • Content Addressable Memory (CAM) latency
    dominates instruction window latency.
  • Load Capacitance of CAM is a major limiting
    factor for speed.
  • Parasitic Capacitance also waste power.
  • Issue logic uses a lot of the power budget
  • 16 for the Pentium Pro
  • 18 for Alpha 21264

21
Unnecessary Circuity
  • Observation Register stations compare broadcast
    tags to both operands. Often, this is
    unnecessary.
  • Only 25 to 35 of architectural instructions
    have two operands.
  • Simulation of speck2k programs shows only 10 to
    20 of instructions need two comparators during
    runtime.

22
Simulation
  • Used SimpleScalar
  • Varied instruction window size 16, 64, 256.
  • Load/Store queue of half window size.

23
Removing extra comparators
  • Specialize the reservation stations.
  • Number of comparators varies by station from 2 to
    0.
  • Stall if no station with minimum comparator
    available
  • Remove some operands by speculating on last
    operand to complete.
  • Needs predictor
  • Miss-predict penalty

24
Predictor
  • Paper discuses GSHARE predictor
  • Its based off branch predictor not seen in class.
  • Idea behind it starts by noting good indexes for
    selecting binary predictors are
  • Branch address
  • Global history
  • Thus if both are good, XORing them together
    should produce an index embodying more
    information than ether alone.

25
Predictor II
  • Here is how GSHARE does for various sizes of the
    prediction table.

26
Mis-pridiction
  • Alpha has scoreboard of valid registers called
    RDY.
  • Check if all operands available in register read
    stage, if not flush pipeline in the same fashion
    as latency miss-prediction.
  • RDY must be expanded to have the number of read
    ports match the issue width.

27
IPC losses
  • Reservation stations with two ports can be
    exhausted. Causes stalls for speck2k benchmarks
    like SWIM
  • Adding last tag prediction improves SWIM
    performance but causes 1-3 losses for benchmarks
    such as Crafly and Gcc due to misprediction

28
Simulation
  • Format show is for number of two tag/one tag/
    zero tag
  • Last tag predictor used only on entries with no
    two tag reservation stations.

29
Benefits of comparator removal
  • In most cases clock rate can be 25-45 faster
    since
  • Tag bus no longer must reach all reservation
    stations
  • Removing comparators removes load capacitance
  • Energy saved from capacitance removal is 30-60
  • Power savings dont track energy saves this clock
    rate can now increase.

30
Simulation results for benefits
31
References
  1. Complexity-effective superscalar processors
  2. Subbarao Palacharla and Norman P. Jouppi and J.
    E. Smith
  3. On pipelining dynamic instruction scheduling
    logic
  4. J. Stark, M. D. Brown, and Yale N. Patt
  5. Efficient Dynamic Scheduling Through Tag
    Elimination
  6. Dan Ernst and Todd Austin
  7. Combining Branch Predictors
  8. Scott McFarling

32
Questions?
Write a Comment
User Comments (0)
About PowerShow.com