Title: The anatomy of a modern superscalar processor
1The anatomy of a modern superscalar processor
- Constantinos Kourouyiannis
- Madhava Rao Andagunda
2Outline
- Introduction
- Microarchitecture
- Alpha 21264 processor
- Sim-alpha simulator
- Out of order execution
- Prediction-Speculation
3Introduction
- Superscalar processing is the ability to initiate
multiple instructions during the same cycle. - It aims at producing ever faster microprocessors.
- A typical superscalar processor fetches and
decodes several instructions at a time.
Instructions are executed in parallel based on
the availability of operand data rather than
their original program sequence. Upon completion
instructions are re-sequenced so that they can be
used to update the process state in the correct
program order.
4Outline
- Introduction
- Microarchitecture
- Alpha 21264 processor
- Sim-alpha simulator
- Out of order execution
- Prediction-Speculation
5Microarchitecture
- Instruction Fetch and Branch Prediction
- Decode and Register Dependence Analysis
- Issue and Execution
- Memory Operation Analysis and Execution
- Instruction Reorder and Commit
6Organization of a superscalar processor
7Instruction Fetch and Branch Prediction
- The fetch phase supplies instructions to the rest
of the processing pipeline. - An instruction cache is used to reduce the
latency and increase the bandwidth of instruction
fetch process. - PC searches the cache contents to determine if
the instruction being addressed is present in one
of the cache lines. - In a superscalar implementation, the fetch phase
fetches multiple instructions per cycle from
cache memory.
8Branch Instructions
- Recognizing conditional branches
- Decode information (extra bits) is held in the
instruction cache with every instruction - Determining the branch outcome
- Branch prediction using information regarding
past history of branch outcomes. - Computing the branch target
- Usually integer addition (PC offset value)
- Branch Target Buffer holds target address used
last time the branch was executed - Transferring control
- If branch taken ? at least one clock cycle delay
to recognize branch, modify PC and fetch
instructions from target address
9Instruction Decode
- Instructions are removed from fetch buffers,
examined and data dependence linkages are set up. - Data dependences
- True dependences can cause a read after write
(RAW) hazard - Artificial dependences can cause write after
read (WAR) and write after write (WAW) hazards. - Hazards
- RAW occurs when a consuming instruction reads a
value before the producing instruction writes it. - WAR occurs when an instruction writes a value
before a preceding instruction reads it. - WAW occurs when multiple instructions update the
same storage location but not in the proper
order.
10Instruction Decode (cont.)
- Example of data hazards
- For each instruction, decode phase sets up the
operation to be executed, the identities of
storage elements where input reside and the
locations where result must be placed. - Artificial dependences are eliminated through
register renaming.
11Instruction Issue and Parallel Execution
- Run-time checking for availability of data and
resources. - An instruction is ready to execute as soon as its
input operands are available. However, there are
other constraints such as execution units and
register file ports. - An issue queue is responsible for holding the
instructions until their input operands are
available. - Out-of-order execution the ability of executing
instructions not in the program order but as soon
as their operands are ready.
12Handling Memory Operations
- For memory operations, the decode phase cannot
identify the memory locations that will be
accessed. - The determination of the memory location that
will be accessed requires an address calculation,
usually integer addition. - Once a valid address is obtained, the load or
store operation is submitted to memory.
13Committing State
- The effects of an instruction are allowed to
modify the logical process state. - The purpose of this phase is to implement the
appearance of a sequential execution model, even
though the actual execution is not sequential. - Machine state is separated into physical and
logical. Physical state is updated as the
operations complete while logical is updated in
sequential program order.
14Outline
- Introduction
- Microarchitecture
- Alpha 21264 processor
- Sim-alpha simulator
- Out of order execution
- Prediction-Speculation
15Alpha 21264 (EV6)
16Instruction Fetch
- Fetches 4 instructions per cycle
- Large 64 KB 2-way associative instruction cache
- Branch predictor dynamically chooses between
local and global history
17Register Renaming
- Assignment of a unique storage location with each
write-reference to a register. - The register allocated becomes part of the
architectural state only when the instruction
commits. - Elimination of WAW and WAR dependences but
preservation of RAW dependences necessary for
correct computation. - 64 architectural 41 integer 41 floating point
registers available to hold speculative results
prior to instruction retirement in an 80
instruction in-flight window.
18Issue Queues
- 20 entry integer queue
- can issue 4 instructions per cycle
- 15 entry floating-point queue
- can issue 2 instructions per cycle
- A list of pending instructions is kept and each
cycle these queues select from these instructions
as their input data are ready. - Queues issue instructions speculatively and older
instructions are given priority over newer in the
queue. - An issue queue entry becomes available when the
instruction issues or is squashed due to
mis-speculation.
19Execution Engine
- All execution units require access to the
register file. - The register file is split into two clusters that
contain duplicates of the 80-entry register file. - Two pipes access a single register file to form a
cluster and the two clusters are combined to
support 4-way integer execution. - Two floating point execution pipes are organized
in a single cluster with a single 72-entry
register file.
20Memory System
- Supports in-flight memory references and
out-of-order operation - Receives up to 2 memory operations from the
integer execution pipes every cycle - 64 KB 2-way set associative data cache and direct
mapped level-two cache (ranges from 1 to 16 MB) - 3-cycle latency for integer loads and 4 cycles
for FP loads
21Store/Load Memory Ordering
- Memory system supports capabilities of
out-of-order execution but maintains an in-order
architectural memory model. - It would be wrong if a later load issued prior to
an earlier store to the same address. - This RAW memory dependency cannot be handled by
rename logic because it doesnt know the memory
address before instruction issue. - If a load is incorrectly issued before an earlier
store to the same address, the 21264 trains the
out-of-order execution core to avoid it on
subsequent executions of the same load. It sets a
bit on a load wait table, that forces the issue
point of the load to be delayed until all prior
stores have issued.
22Load Hit/ Miss Prediction
- To achieve the 3-cycle integer load hit latency,
it is necessary to speculatively issue consumers
of integer load data before knowing if the load
hit or missed in the data cache. - If the load eventually misses, two integer cycles
are squashed and all integer instructions that
issued during those cycles are pulled back in the
issue queue to be re-issued later. - The 21264 predicts when loads will miss and does
not speculatively issue the consumers of the load
in that case. Effective load latency 5 cycles
for an integer load hit that is incorrectly
predicted to miss.
23Outline
- Introduction
- Microarchitecture
- Alpha 21264 processor
- Sim-alpha simulator
- Out of order execution
- Prediction-Speculation
24Sim-alpha simulator
- Sim-alpha is a simulator that models Alpha 21264.
- It models the implementation constraints and
low-level features in the 21264. - Allows user to vary the different parameters of
the processor, such as fetch width, reorder
buffer size and issue queue sizes.
25Outline
- Introduction
- Microarchitecture
- Alpha 21264 processor
- Sim-alpha simulator
- Out of order execution
- Prediction-Speculation
26Out-Of-Order Execution
- Why Out-Of-Order Execution?
- - In-order Processors Stalls
- Pipeline may not be full because
of the frequent stalls - Example
In the Out-Of-Order Processors - No
dependency Move the Instruction for execution
- Means Allow the Instructions that are Ready
27Out-Of-Order Execution
- Theme Stall Only on RAW hazards Structural
Hazards - - RAW Hazard
- Example LD R4 10(R5)
- ADD R6 R4 R8
- - Structural Hazard
- - Occurs because of Resource Conflicts
- - Example If the CPU is designed with
a single Interface to Memory - This Interface is
always used during IF - Also used in MEM for
LOAD or STORE Operations. - When a Load or Store
gets into MEM stage IF must stall
28Outline
- Introduction
- Microarchitecture
- Alpha 21264 processor
- Sim-alpha simulator
- Out of order execution
- Prediction-Speculation
29Prediction Speculation
- Problem Serialization due to Dependences
- - Wait Until Producing Instruction Execute
- - But We need Optimal Utilization of Resources
- Solution
- - Predict Unknown information
- - Allow the processor to proceed
- (Assuming predicted information is correct)
- - If the Prediction is Correct Proceed
Normally - - If not squash the speculatively executed
instructions - - Restore the status
- - Start In the correct Direction
30Taxonomy Of Speculation
Two Possibilities -Worst case 50 accuracy
For a 32 bit Register Possibilities - Worst
case ?
31Prediction Speculation
- Control Speculation
- - Current Branch Predictors are highly
Accurate - - Implemented in Commercial
Processors - Data Speculation
- - Lot of research (Value Profiling,
Value prediction etc) - - Not Implemented in the Current
Processors - - Till Now Very Little Effects on
Current Designs -