Reducing Issue Logic Complexity in Superscalar Microprocessors - PowerPoint PPT Presentation

About This Presentation

Title:

Reducing Issue Logic Complexity in Superscalar Microprocessors

Description:

Reducing Issue Logic Complexity in Superscalar Microprocessors ... Budget / Deluxe speculatively woken up scheduling. Ideal 1 cycle scheduling pipeline ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 33

Provided by: Gan950

Learn more at: https://www.engineering.iastate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Reducing Issue Logic Complexity in Superscalar Microprocessors

1
Reducing Issue Logic Complexity in Superscalar
Microprocessors

Survey Project
CprE 585 Advanced Computer Architecture
David Lastine
Ganesh Subramanian

2
Introduction

The ultimate goal of any computer architect
designing a fast machine
Approaches
Increasing clocking rate (Help from VLSI)
Increasing bus width
Increasing pipeline depth
Superscalar architectures
Tradeoffs between hardware complexity and clock
speed
Given a particular technology, the more complex
the hardware, the lesser is the clocking rate

3
A New Paradigm

Retaining the effective functionality of complex
superscalar processors
Target the bottleneck in present day
microprocessors
Instruction scheduling is the throughput limiter
Need to effectively handle register renaming,
issue window and wakeup selector
Increase the clocking rate
Rethinking circuit design methodologies
Modifying architectural design strategies
Wanting to have the cake and eat it too?
Aim at reducing power consumption too

4
Approaches to Handle Issue Logic Complexity

Performance IPC Clock Frequency
Pipelining scheduling logic reduces the IPC
Non-pipelined scheduling logic reduces clocking
rate
Architectural solutions
Non-pipelined scheduling with dependence queue
based issue logic Complexity Effective 1
Pipelined scheduling with speculative wakeup 2
Generic speed up and power conservation using tag
elimination 3

5
Baseline Superscalar Model

The rename and the wake-up select stages of the
generic superscalar pipeline model need to be
targeted
Consider VLSI effects and decide to redesign a
particular design component

6
Analyzing Baseline Implementations

Physical layout implementation of microprocessor
circuits optimized for speed
Usage of dynamic logic for bottleneck circuits
Manual sizing of transistors in critical path
Logic optimizations like two level decomposition
Components analyzed
Register rename logic
Wakeup Logic / Issue window
Selection logic
Bypass logic

7
Register Rename Logic

RAM vs. CAM
Focus on RAM due to scalability
Decreasing feature sizes do not correspondingly
scale down wire delays, but only logic delays
Delay relation with issue width is quadratic, but
effectively linear
Need to handle wordline and bitline delays in
future

8
Wakeup Logic

CAM is preferred
Tag drive times are quadratic functions of window
size as well as issue width
Matching times are quadratic functions of issue
width only
All delays are effectively linear for considered
design space
Need to handle broadcast operation delays in
future

9
Selection Logic

Tree of arbiters
Requests flow down while functional unit grants
flow up to the issue window
Necessity of a selection policy (Oldest First /
Leftmost First)
Delays proportional to the logarithm of the
window size
All delays considered are logic delays

10
Bypass Logic

Number of bypass paths dependent upon pipeline
depth (linear) and issue width (quadratic)
Composed of operand muxes and buffer drivers
Delays are quadratically proportional to length
of result wires and hence issue width
Insignificant compared to other delays as feature
size reduces

11
Complexity Effective Microarchitecture Design
Premises

Retain benefits of complex issue schemes but
enable faster clocking
Design assumption Should not pipeline wakeup
select, or data bypassing, as these are atomic
operations (if dependent instruction should be
executable in consecutive cycles)

12
Dependence Based Microarchitecture

Replace Issue Window by FIFOs with each queue
composed of dependent instructions
Steer instructions to the appropriate FIFO in
rename stage using heuristics
SRC_FIFO and Reservations Tables to handle
dependencies and wakeup
IPC reduces but clocking rate increases to give a
faster implementation

13
Clustering Dependence Based Microarchitectures

Reducing bypass delays by reducing length of
bypass paths
Minimization of inter-cluster communication,
extra cycle penalty otherwise
Clustered Microarchitecture Types
Single Window, Execution Driven Steering
Two Windows, Dispatch Driven Steering - Best
Two Windows, Random Steering

14
Pipelining Dynamic Instruction Scheduling Logic

WakeupSelect was held atomic in previous
implementation
Increase performance by pipelining it, but retain
execution of dependent instruction in consecutive
cycles
Speculate on the wakeup by predicting based on
both parent and grandparent instructions
Integrated into the Tomasulo approach

15
Wakeup Logic Details

Tag broadcast as soon as instruction begins
execution
Broadcast Execution Completion latency
specified as shown
Match bit acts as the sticky bit to enable delay
countdown
Need not always be correct due to unexpected
stalls
Select logic remains as in previous work

16
Pipelining Rename Logic

Assumption by child instruction that parent would
broadcast its tag in the next cycle, IF
grandparent instructions broadcasts tag
Speculative wakeup on grandparent tag receiving
for selection in the next cycle
Speculative since parent selection for execution
is not guaranteed
Modifications in rename map and dependency
analysis logic

17
Wakeup and Select Logic

Wakeup request sent after looking into ready bits
from the parents and grandparents tags
A multi-cycle parents field can be ignored
In addition to speculative readiness signified by
request line, a confirm line is activated when
all parents are ready
False selection involve non-confirmed requests
Problematic only when really ready instructions
are not selected

18
Implementation Experimentation Details

Usage of a cycle accurate execution driven
simulator for the Alpha ISA
Baseline conventional scheduled (2) pipeline
Budget / Deluxe speculatively woken up
scheduling
Ideal 1 cycle scheduling pipeline
Factors like issue width and reservation station
depth considered
Significant reduction in critical path with minor
IPC impacts
Enables higher clock frequencies, deeper
pipelines and larger instruction windows for
better performance

19
Paradigm shift

So far weve added hardware to improve
performance
However issue window could also be improved by
removing hardware

20
Current Situation of Issue Windows

Content Addressable Memory (CAM) latency
dominates instruction window latency.
Load Capacitance of CAM is a major limiting
factor for speed.
Parasitic Capacitance also waste power.
Issue logic uses a lot of the power budget
16 for the Pentium Pro
18 for Alpha 21264

21
Unnecessary Circuity

Observation Register stations compare broadcast
tags to both operands. Often, this is
unnecessary.
Only 25 to 35 of architectural instructions
have two operands.
Simulation of speck2k programs shows only 10 to
20 of instructions need two comparators during
runtime.

22
Simulation

Used SimpleScalar
Varied instruction window size 16, 64, 256.
Load/Store queue of half window size.

23
Removing extra comparators

Specialize the reservation stations.
Number of comparators varies by station from 2 to
0.
Stall if no station with minimum comparator
available
Remove some operands by speculating on last
operand to complete.
Needs predictor
Miss-predict penalty

24
Predictor

Paper discuses GSHARE predictor
Its based off branch predictor not seen in class.
Idea behind it starts by noting good indexes for
selecting binary predictors are
Branch address
Global history
Thus if both are good, XORing them together
should produce an index embodying more
information than ether alone.

25
Predictor II

Here is how GSHARE does for various sizes of the
prediction table.

26
Mis-pridiction

Alpha has scoreboard of valid registers called
RDY.
Check if all operands available in register read
stage, if not flush pipeline in the same fashion
as latency miss-prediction.
RDY must be expanded to have the number of read
ports match the issue width.

27
IPC losses

Reservation stations with two ports can be
exhausted. Causes stalls for speck2k benchmarks
like SWIM
Adding last tag prediction improves SWIM
performance but causes 1-3 losses for benchmarks
such as Crafly and Gcc due to misprediction

28
Simulation