The anatomy of a modern superscalar processor - PowerPoint PPT Presentation

About This Presentation

Title:

The anatomy of a modern superscalar processor

Description:

... typical superscalar processor fetches and decodes several instructions at a time. ... For each instruction, decode phase sets up the operation to be executed, the ... – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 32

Provided by: cs056

Category:

more less

Transcript and Presenter's Notes

Title: The anatomy of a modern superscalar processor

1
The anatomy of a modern superscalar processor

Constantinos Kourouyiannis
Madhava Rao Andagunda

2
Outline

Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation

3
Introduction

Superscalar processing is the ability to initiate
multiple instructions during the same cycle.
It aims at producing ever faster microprocessors.
A typical superscalar processor fetches and
decodes several instructions at a time.
Instructions are executed in parallel based on
the availability of operand data rather than
their original program sequence. Upon completion
instructions are re-sequenced so that they can be
used to update the process state in the correct
program order.

4
Outline

Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation

5
Microarchitecture

Instruction Fetch and Branch Prediction
Decode and Register Dependence Analysis
Issue and Execution
Memory Operation Analysis and Execution
Instruction Reorder and Commit

6
Organization of a superscalar processor
7
Instruction Fetch and Branch Prediction

The fetch phase supplies instructions to the rest
of the processing pipeline.
An instruction cache is used to reduce the
latency and increase the bandwidth of instruction
fetch process.
PC searches the cache contents to determine if
the instruction being addressed is present in one
of the cache lines.
In a superscalar implementation, the fetch phase
fetches multiple instructions per cycle from
cache memory.

8
Branch Instructions

Recognizing conditional branches
Decode information (extra bits) is held in the
instruction cache with every instruction
Determining the branch outcome
Branch prediction using information regarding
past history of branch outcomes.
Computing the branch target
Usually integer addition (PC offset value)
Branch Target Buffer holds target address used
last time the branch was executed
Transferring control
If branch taken ? at least one clock cycle delay
to recognize branch, modify PC and fetch
instructions from target address

9
Instruction Decode

Instructions are removed from fetch buffers,
examined and data dependence linkages are set up.
Data dependences
True dependences can cause a read after write
(RAW) hazard
Artificial dependences can cause write after
read (WAR) and write after write (WAW) hazards.
Hazards
RAW occurs when a consuming instruction reads a
value before the producing instruction writes it.
WAR occurs when an instruction writes a value
before a preceding instruction reads it.
WAW occurs when multiple instructions update the
same storage location but not in the proper
order.

10
Instruction Decode (cont.)

Example of data hazards
For each instruction, decode phase sets up the
operation to be executed, the identities of
storage elements where input reside and the
locations where result must be placed.
Artificial dependences are eliminated through
register renaming.

11
Instruction Issue and Parallel Execution

Run-time checking for availability of data and
resources.
An instruction is ready to execute as soon as its
input operands are available. However, there are
other constraints such as execution units and
register file ports.
An issue queue is responsible for holding the
instructions until their input operands are
available.
Out-of-order execution the ability of executing
instructions not in the program order but as soon
as their operands are ready.

12
Handling Memory Operations

For memory operations, the decode phase cannot
identify the memory locations that will be
accessed.
The determination of the memory location that
will be accessed requires an address calculation,
usually integer addition.
Once a valid address is obtained, the load or
store operation is submitted to memory.

13
Committing State

The effects of an instruction are allowed to
modify the logical process state.
The purpose of this phase is to implement the
appearance of a sequential execution model, even
though the actual execution is not sequential.
Machine state is separated into physical and
logical. Physical state is updated as the
operations complete while logical is updated in
sequential program order.

14
Outline

Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation

15
Alpha 21264 (EV6)
16
Instruction Fetch

Fetches 4 instructions per cycle
Large 64 KB 2-way associative instruction cache
Branch predictor dynamically chooses between
local and global history

17
Register Renaming

Assignment of a unique storage location with each
write-reference to a register.
The register allocated becomes part of the
architectural state only when the instruction
commits.
Elimination of WAW and WAR dependences but
preservation of RAW dependences necessary for
correct computation.
64 architectural 41 integer 41 floating point
registers available to hold speculative results
prior to instruction retirement in an 80
instruction in-flight window.

18
Issue Queues

20 entry integer queue
can issue 4 instructions per cycle
15 entry floating-point queue
can issue 2 instructions per cycle
A list of pending instructions is kept and each
cycle these queues select from these instructions
as their input data are ready.
Queues issue instructions speculatively and older
instructions are given priority over newer in the
queue.
An issue queue entry becomes available when the
instruction issues or is squashed due to
mis-speculation.

19
Execution Engine

All execution units require access to the
register file.
The register file is split into two clusters that
contain duplicates of the 80-entry register file.
Two pipes access a single register file to form a
cluster and the two clusters are combined to
support 4-way integer execution.
Two floating point execution pipes are organized
in a single cluster with a single 72-entry
register file.

20
Memory System

Supports in-flight memory references and
out-of-order operation
Receives up to 2 memory operations from the
integer execution pipes every cycle
64 KB 2-way set associative data cache and direct
mapped level-two cache (ranges from 1 to 16 MB)
3-cycle latency for integer loads and 4 cycles
for FP loads

21
Store/Load Memory Ordering

Memory system supports capabilities of
out-of-order execution but maintains an in-order
architectural memory model.
It would be wrong if a later load issued prior to
an earlier store to the same address.
This RAW memory dependency cannot be handled by
rename logic because it doesnt know the memory
address before instruction issue.
If a load is incorrectly issued before an earlier
store to the same address, the 21264 trains the
out-of-order execution core to avoid it on
subsequent executions of the same load. It sets a
bit on a load wait table, that forces the issue
point of the load to be delayed until all prior
stores have issued.

22
Load Hit/ Miss Prediction

To achieve the 3-cycle integer load hit latency,
it is necessary to speculatively issue consumers
of integer load data before knowing if the load
hit or missed in the data cache.
If the load eventually misses, two integer cycles
are squashed and all integer instructions that
issued during those cycles are pulled back in the
issue queue to be re-issued later.
The 21264 predicts when loads will miss and does
not speculatively issue the consumers of the load
in that case. Effective load latency 5 cycles
for an integer load hit that is incorrectly
predicted to miss.

23
Outline

Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation

24
Sim-alpha simulator

Sim-alpha is a simulator that models Alpha 21264.
It models the implementation constraints and
low-level features in the 21264.
Allows user to vary the different parameters of
the processor, such as fetch width, reorder
buffer size and issue queue sizes.

25
Outline

Introduction
Microarchitecture
Alpha 21264 processor
Sim-alpha simulator
Out of order execution
Prediction-Speculation

26
Out-Of-Order Execution

Why Out-Of-Order Execution?
- In-order Processors Stalls
Pipeline may not be full because
of the frequent stalls
Example

In the Out-Of-Order Processors - No
dependency Move the Instruction for execution
- Means Allow the Instructions that are Ready
27
Out-Of-Order Execution