Title: Reducing Issue Logic Complexity in Superscalar Microprocessors
1Reducing Issue Logic Complexity in Superscalar
Microprocessors
- Survey Project
- CprE 585 Advanced Computer Architecture
- David Lastine
- Ganesh Subramanian
2Introduction
- The ultimate goal of any computer architect
designing a fast machine - Approaches
- Increasing clocking rate (Help from VLSI)
- Increasing bus width
- Increasing pipeline depth
- Superscalar architectures
- Tradeoffs between hardware complexity and clock
speed - Given a particular technology, the more complex
the hardware, the lesser is the clocking rate
3A New Paradigm
- Retaining the effective functionality of complex
superscalar processors - Target the bottleneck in present day
microprocessors - Instruction scheduling is the throughput limiter
- Need to effectively handle register renaming,
issue window and wakeup selector - Increase the clocking rate
- Rethinking circuit design methodologies
- Modifying architectural design strategies
- Wanting to have the cake and eat it too?
- Aim at reducing power consumption too
4Approaches to Handle Issue Logic Complexity
- Performance IPC Clock Frequency
- Pipelining scheduling logic reduces the IPC
- Non-pipelined scheduling logic reduces clocking
rate - Architectural solutions
- Non-pipelined scheduling with dependence queue
based issue logic Complexity Effective 1 - Pipelined scheduling with speculative wakeup 2
- Generic speed up and power conservation using tag
elimination 3
5Baseline Superscalar Model
- The rename and the wake-up select stages of the
generic superscalar pipeline model need to be
targeted - Consider VLSI effects and decide to redesign a
particular design component
6Analyzing Baseline Implementations
- Physical layout implementation of microprocessor
circuits optimized for speed - Usage of dynamic logic for bottleneck circuits
- Manual sizing of transistors in critical path
- Logic optimizations like two level decomposition
- Components analyzed
- Register rename logic
- Wakeup Logic / Issue window
- Selection logic
- Bypass logic
7Register Rename Logic
- RAM vs. CAM
- Focus on RAM due to scalability
- Decreasing feature sizes do not correspondingly
scale down wire delays, but only logic delays - Delay relation with issue width is quadratic, but
effectively linear - Need to handle wordline and bitline delays in
future
8Wakeup Logic
- CAM is preferred
- Tag drive times are quadratic functions of window
size as well as issue width - Matching times are quadratic functions of issue
width only - All delays are effectively linear for considered
design space - Need to handle broadcast operation delays in
future
9Selection Logic
- Tree of arbiters
- Requests flow down while functional unit grants
flow up to the issue window - Necessity of a selection policy (Oldest First /
Leftmost First) - Delays proportional to the logarithm of the
window size - All delays considered are logic delays
10Bypass Logic
- Number of bypass paths dependent upon pipeline
depth (linear) and issue width (quadratic) - Composed of operand muxes and buffer drivers
- Delays are quadratically proportional to length
of result wires and hence issue width - Insignificant compared to other delays as feature
size reduces
11Complexity Effective Microarchitecture Design
Premises
- Retain benefits of complex issue schemes but
enable faster clocking - Design assumption Should not pipeline wakeup
select, or data bypassing, as these are atomic
operations (if dependent instruction should be
executable in consecutive cycles)
12Dependence Based Microarchitecture
- Replace Issue Window by FIFOs with each queue
composed of dependent instructions - Steer instructions to the appropriate FIFO in
rename stage using heuristics - SRC_FIFO and Reservations Tables to handle
dependencies and wakeup - IPC reduces but clocking rate increases to give a
faster implementation
13Clustering Dependence Based Microarchitectures
- Reducing bypass delays by reducing length of
bypass paths - Minimization of inter-cluster communication,
extra cycle penalty otherwise - Clustered Microarchitecture Types
- Single Window, Execution Driven Steering
- Two Windows, Dispatch Driven Steering - Best
- Two Windows, Random Steering
14Pipelining Dynamic Instruction Scheduling Logic
- WakeupSelect was held atomic in previous
implementation - Increase performance by pipelining it, but retain
execution of dependent instruction in consecutive
cycles - Speculate on the wakeup by predicting based on
both parent and grandparent instructions - Integrated into the Tomasulo approach
15Wakeup Logic Details
- Tag broadcast as soon as instruction begins
execution - Broadcast Execution Completion latency
specified as shown - Match bit acts as the sticky bit to enable delay
countdown - Need not always be correct due to unexpected
stalls - Select logic remains as in previous work
16Pipelining Rename Logic
- Assumption by child instruction that parent would
broadcast its tag in the next cycle, IF
grandparent instructions broadcasts tag - Speculative wakeup on grandparent tag receiving
for selection in the next cycle - Speculative since parent selection for execution
is not guaranteed - Modifications in rename map and dependency
analysis logic
17Wakeup and Select Logic
- Wakeup request sent after looking into ready bits
from the parents and grandparents tags - A multi-cycle parents field can be ignored
- In addition to speculative readiness signified by
request line, a confirm line is activated when
all parents are ready - False selection involve non-confirmed requests
- Problematic only when really ready instructions
are not selected
18Implementation Experimentation Details
- Usage of a cycle accurate execution driven
simulator for the Alpha ISA - Baseline conventional scheduled (2) pipeline
- Budget / Deluxe speculatively woken up
scheduling - Ideal 1 cycle scheduling pipeline
- Factors like issue width and reservation station
depth considered - Significant reduction in critical path with minor
IPC impacts - Enables higher clock frequencies, deeper
pipelines and larger instruction windows for
better performance
19Paradigm shift
- So far weve added hardware to improve
performance - However issue window could also be improved by
removing hardware
20Current Situation of Issue Windows
- Content Addressable Memory (CAM) latency
dominates instruction window latency. - Load Capacitance of CAM is a major limiting
factor for speed. - Parasitic Capacitance also waste power.
- Issue logic uses a lot of the power budget
- 16 for the Pentium Pro
- 18 for Alpha 21264
21Unnecessary Circuity
- Observation Register stations compare broadcast
tags to both operands. Often, this is
unnecessary. - Only 25 to 35 of architectural instructions
have two operands. - Simulation of speck2k programs shows only 10 to
20 of instructions need two comparators during
runtime.
22Simulation
- Used SimpleScalar
- Varied instruction window size 16, 64, 256.
- Load/Store queue of half window size.
23Removing extra comparators
- Specialize the reservation stations.
- Number of comparators varies by station from 2 to
0. - Stall if no station with minimum comparator
available - Remove some operands by speculating on last
operand to complete. - Needs predictor
- Miss-predict penalty
24Predictor
- Paper discuses GSHARE predictor
- Its based off branch predictor not seen in class.
- Idea behind it starts by noting good indexes for
selecting binary predictors are - Branch address
- Global history
- Thus if both are good, XORing them together
should produce an index embodying more
information than ether alone.
25Predictor II
- Here is how GSHARE does for various sizes of the
prediction table.
26Mis-pridiction
- Alpha has scoreboard of valid registers called
RDY. - Check if all operands available in register read
stage, if not flush pipeline in the same fashion
as latency miss-prediction. - RDY must be expanded to have the number of read
ports match the issue width.
27IPC losses
- Reservation stations with two ports can be
exhausted. Causes stalls for speck2k benchmarks
like SWIM - Adding last tag prediction improves SWIM
performance but causes 1-3 losses for benchmarks
such as Crafly and Gcc due to misprediction
28Simulation
- Format show is for number of two tag/one tag/
zero tag - Last tag predictor used only on entries with no
two tag reservation stations.
29Benefits of comparator removal
- In most cases clock rate can be 25-45 faster
since - Tag bus no longer must reach all reservation
stations - Removing comparators removes load capacitance
- Energy saved from capacitance removal is 30-60
- Power savings dont track energy saves this clock
rate can now increase.
30Simulation results for benefits
31References
- Complexity-effective superscalar processors
- Subbarao Palacharla and Norman P. Jouppi and J.
E. Smith - On pipelining dynamic instruction scheduling
logic - J. Stark, M. D. Brown, and Yale N. Patt
- Efficient Dynamic Scheduling Through Tag
Elimination - Dan Ernst and Todd Austin
- Combining Branch Predictors
- Scott McFarling
32Questions?