Title: Superscalar Processors by
1Superscalar Processors by
2Overview
- What are superscalar processors?
- Program Representation, Dependencies, Parallel
Execution - Micro architecture of a typical superscalar
processor - A look at 3 superscalar implementations
- Conclusion The future of superscalar processing
3What are superscalars and how do they differ
from pipelines?
- In simple pipelining, you are limited to fetching
1 single instruction into the pipeline per clock
cycle. This causes a performance bottleneck. - Superscalar processors overcome the 1 instruction
per clock cycle limit of simple pipelines and
possess the ability to fetch multiple
instructions during the same clock cycle. They
also employ advanced techniques like branch
prediction to ensure an uninterrupted stream of
instructions.
4Development History of Superscalars
- Pipelining was developed in the late 1950s and
became popular in the 1960s. - Examples of early pipelined architectures are the
CDC 6600 and the IBM 360/91 (Tomasulos
algorithm) - Superscalars appeared in the mid to late 1980s
5Instruction Processing Model
- Need to maintain software compatibility.
- The assembly instruction set was the level chosen
to maintain compatibility because it did not
affect existing software. - Need to maintain at least a semblance of a
sequential execution model for programmers who
rely on the concept of sequential execution in
software design. - A superscalar processor may execute instructions
out of order at the hardware level, but execution
must appear sequential at the programming
level.
6Superscalar Implementation
- Instruction fetch strategies that simultaneously
fetch multiple instructions often by using branch
prediction techniques. - Methods for determining data dependencies and
keeping track of register values during execution - Methods for issuing multiple instructions in
parallel - Resources for parallel execution of many
instructions including multiple pipelined
functional units and memory hierarchies capable
of simultaneously servicing multiple memory
references. - Methods for communicating data values through
memory through load and store instructions. - Methods for committing the process state in
correct order. This is to maintain the outward
appearance of sequential execution.
7From Sequential to Parallel
- Parallel execution often results in instructions
completing non sequentially. - Speculative execution means that some
instructions may be executed when they would not
have been executed at all according to the
sequential model (i.e. incorrect branch
prediction). - To maintain the outward appearance of sequential
execution for the programmer, storage cannot be
updated immediately. The results must be held in
temporary status until the storage us updated.
Meanwhile, these temporary results must be usable
by dependant instructions. - When its determined that the sequential model
would have executed an instruction, the temporary
results are made permanent by updating the
outward state of the machine. This process is
called committing the instruction.
8Dependencies
- Parallel Execution introduces 2 types of
dependencies - Control dependencies due to incrementing or
updating the program counter in response to
conditional branch instructions. - Data dependencies due to resource contention as
instructions may need to read / write to the same
storage or memory locations.
9Overcoming Control Dependencies Example
- L2 mov r3,r7
- lw r8,(r3)
- add r3,r3,4
- lw r9,(r3)
- ble r8,r9,L3
-
- move r3,r7
- sw r9,(r3)
- add r3,r3,4
- sw r8,(r3)
- add r5,r5,1
- L3 add r6,r6,1
- add r7,r7,4
- blt r6,r4,L2
-
- Blocks are issued are initiated into the window
of execution.
Block 1
Block 2
Block 3
10Control Dependencies Branch Predicition
- To gain the most parallelism, control
dependencies due to conditional branches has to
be overcome. - Branch prediction attempts to overcome this by
predicting the outcome of a branch and
speculatively fetching and executing instructions
from the predicted path. - If the predicted path is correct, the speculative
status of the instructions is removed and they
affect the state of the machine like any other
instruction. - If the predicted path is wrong, then recovery
actions are taken so as not to incorrectly modify
the state of the machine.
11Data Dependencies
- Data dependencies occur because instructions may
access the same register or memory location - 3 Types of data dependencies or hazards
- RAW (read after write) occurs because a later
instruction can only read a value after a
previous instruction has written it. - WAR (write after read) occurs when an
instruction needs to write a new value into a
storage location but must wait until all
preceding instructions needing to read the old
value have done so. - WAW (write after write) occurs when multiple
instructions update the same storage location it
must appear that these updates occur in the
proper sequence.
12Data Dependency Example
- mov r3,r7
- lw r8,(r3)
- add r3,r3,4
- lw r9,(r3)
- ble r8,r9,L3
RAW
WAW
WAR
13Parallel Execution Method
- Instructions are fetched using branch prediction
to form a dynamic stream of instructions - Instructions are examined for dependencies and
dependencies are removed - Examined instructions are dispatched to the
window of execution (These instructions are no
longer in sequential order, but are ordered
according to their data dependencies. - Instructions are issued from the window in an
order determined by their dependencies and
hardware resource availability. - Following execution, instructions are put back
into their sequential program order and then
committed so their results update the machine
state.
14Superscalar Microarchitecture
- Parallel Execution Method Summarized in 5 phases
- 1. Instruction Fetch Branch Prediction
-
- 2. Decode Register Dependence Analysis
- 3. Issue Execution
- 4. Memory Operation Analysis Execution
- 5. Instruction Reorder Commit
15Superscalar Microarchitecture
16Instruction Fetch Branch Prediction
- Fetch phase must fetch multiple instructions per
cycle from cache memory to keep a steady feed of
instructions going to the other stages. - The number of instructions fetched per cycle
should match or be greater than the peak
instruction decode execution rate (to allow for
cache misses or occasions where the max of
instructions cant be fetched) - For conditional branches, fetch mechanism must be
redirected to fetch instructions from branch
targets. - 4 steps to processing conditional branch
instructions - 1. Recognizing that in instruction is a
conditional branch - 2. Determining the branch outcome (taken or not
taken) - 3. Computing the branch target
- 4. Transferring control by redirecting
instruction fetch (as in the case of a taken
branch)
17Processing Conditional Branches
- STEP 1 Recognizing Conditional Branches
- Instruction decode information is held in the
instruction cache. These extra bits are used to
identify the basic instruction types.
18Processing Conditional Branches
- STEP 2 Determining Branch Outcome
- Static Predictions (information determined from
static binary). Ex Certain opcode types might
result in more branches taken than others or a
backwards branch direction might be more likely
in loops. - Predictions based on profiling information
(execution statistics collected during a previous
run of the program). - Dynamic Predictions (information gathered during
program execution about past history of branch
outcomes). Branch history outcomes are stored in
a branch history table or a branch prediction
table.
19Processing Conditional Branches
- STEP 3 Computing Branch Targets
- Branch targets are usually relative to the
program counter and are computed as - branch target program counter offset
- Finding target addresses can be sped up by having
a branch target buffer which holds the target
address used the last time the branch was
executed. - EX Branch Target Address Cache used in PowerPC
604
20Processing Conditional Branches
- STEP 4 Transferring Control
- Problem Thee is often a delay in recognizing a
branch, modifying the program counter and
fetching the target instructions. - Several Solutions
- Use the stockpiled instructions in the
instructions buffer to mask the delay - Use a buffer that contains instructions from both
taken and not taken branch paths - Delayed Branches Branch does not take effect
until instruction after the branch. This allowed
the fetch of target instructions to overlap
execution of the instruction following the
branch. The also introduce assumptions about
pipeline structure and therefore delayed branches
are rarely used anymore.
21Instruction Decoding, Renaming, Dispatch
- Instructions are removed from the fetch buffers,
decoded and examined for control and data
dependencies. - Instructions are dispatched to buffers associated
with hardware functional units for later issuing
and execution.
22Instruction Decoding
- The decode phase sets up execution tuples for
each instruction. - An execution tuple contains
- An operation to be executed
- The identities of storage elements where input
operands will eventually reside - The locations where an instructions result must
be placed
23Register Renaming
- Used to eliminate WAW and RAW dependencies.
- 2 Types
- Physical register file is larger than logical
register file and a mapping table is used to
associate physical register values with logical
register values. Physical registers are assigned
from a free list. - Reorder Buffer Uses the same size physical and
logical register files. There is also a reorder
buffer that contains 1 entry per active
instruction and maintains the sequential ordering
of instructions. It is a circular queue
implemented in hardware. As instructions are
dispatched they enter the queue at the tail. As
instructions complete, their results are inserted
into their assigned locations in the reorder
buffer. When an instructions reaches the head of
the queue, its entry is removed and its result
placed in the register file.
24Register Renaming I
25Register Renaming II(using a reorder buffer)
26Instruction Issuing Parallel Execution
- Instruction issuing is defined as the run-time
checking for availability of data and resources. - Constraints on instruction issue
- Availability of physical resources like
instruction units, interconnect, and register
file - Organization of buffers holding execution tuples
27Single Queue Method
- If there is no out of order issuing, operand
availability can be managed via reservation bits
assigned to each register. - A register is reserved when an instruction
modifying the register issues. - A register is cleared when the instruction
completes. - Instructions may issue if there are no
reservations on its operands.
28Multiple Queue Method
- There are multiple queues organized according to
instruction type. - Instructions issue from individual queues in
sequential order. - Individual queues may issue out of order with
respect to one another.
29Reservation Stations
- Instructions issue out of order
- Reservation stations hold information about
- source operands for an operation.
- When all operands are present, the instruction
may issue. - Reservation stations may be partitioned according
to instruction type or pooled into a single large
block.
30Memory Operation Analysis Execution
- To reduce latency, memory hierarchies are used
may contain primary and secondary caches. - Address translation to physical addresses is
improved by using a translation lookaside
buffer which contains a cache of recently
accessed pages. - Multiported memory hierarchy is used to allow
multiple memory requests to be serviced
simultaneously. Multiporting is achieved by
having multiple memory banks or making multiple
serial requests during the same cycle. - Store address buffers are used to make sure
memory operations dont violate hazard
conditions. Store address buffers contain the
addresses of all pending store operations.
31Memory Hazard Detection
32Instruction Reorder Commit
- When an instruction is committed, its result is
allowed to modify the logical state of the
machine. - The purpose of the commit phase is to maintain
the illusion of a sequential execution model. - 2 methods
- 1. The state of the machine is saved in a
history buffer. Instruction update the state of
the machine as they execute and when there is a
problem, the state of the machine can be
recovered from the history buffer. The commit
phase gets rid of the history state thats no
longer needed. -
- 2. The state of the machine is separated into a
physical state and a logical state. The physical
state is updated in memory as instructions
complete. The logical state is updated in a
sequential order as the speculative status of
instructions is cleared. The speculative state is
maintained in a reorder buffer and during the
commit phase, the result of an operation is moved
from the reorder buffer to a logical register or
memory.
33The Role of Software
- Superscalars can be made more efficient if
parallelism in software can be increased. - 1. By increasing the likelihood that a group of
instructions can be issued simultaneously - 2. By decreasing the likelihood that an
instruction has to wait for the result of a
previous instruction
34A Look At 3 Superscalar Processors
- MIPS R10000
- DEC Alpha 21164
- AMD K5
35MIPS R10000
- Typical superscalar processor
- Able to fetch 4 instructions at a time
- Uses predecode to generate bits to assist with
branch prediction (512 entry prediction table) - Resume cache is used to fetch not taken
instructions and has space to handle 4 branch
predictions at a time - Register renaming uses a physical register file
2x the size of the logical register file.
Physical registers are allocated from a free list - 3 instruction queues memory, integer, and
floating point - 5 functional units (an address adder, 2 integer
ALUs, a floating point multiplier / divider /
square rooter, floating point adder) - Supports on-chip primary data cache (32 KB, 2 way
set associative) and an off-chip secondary cache. - Uses reorder buffer mechanism to maintain machine
state during execptions. - Instructions are committed 4 at a time
36Alpha 21164
- Simple superscalar that forgoes the advantage of
dynamic scheduling in favor of a high clock rate - 4 Instructions at a time are fetched from an 8K
instruction cache - 2 instruction buffers that issue instructions in
program order - Branches are predicted using a history table
associated with the instruction cache - Uses the single queue method of instruction
issuing - 4 functional units (2 ALUs, a floating point
adder, and a floating point multiplier) - 2 level cache memory (primary 8K cache
secondary 96 K 3way set associative cache) - Sequential machine state is maintained during
interrupts because instructions are not issued
out of order - The pipeline functions as a simple reorder buffer
since instructions in the pipeline are maintained
in sequential order
37Alpha 21164 Superscalar Organization
38AMD-K5
- Implements the complex Intel x86 instruction set
- Use 5 pre-decode bits for decoding variable
length instructions - Instructions are fetched from the instruction
cache at a rate of 16 bytes / cycle placed in a
16 element queue. - Branch prediction is integrated with the
instruction cache. There is 1 prediction entry
per cache line. - Due to instruction set complexity, 2 cycles are
required to decode - Instructions are converted to ROPS (simple risc
like operations) - Instructions read operand data are dispatched
to functional unit reservation stations - There are 6 functional units 2 integer ALUs, 1
floating point unit, 2 load/ store units a
branch unit. - Up to 4 ROPs can be issued per clock cycle
- Has an 8K data cache with 4 banks. Dual load/
stores are allowed to different banks. - 16 entry reorder buffer maintains machine state
when there is an exception and recovers from
incorrect branch predictions
39AMD K5 Superscalar Organization
40The Future of Superscalar Processing
- Superscalar design performance gain
- BUT increasing hardware parallelism may be a case
of diminishing returns. - There are limits to instruction level parallelism
in programs that can be exploited. - Simultaneously issuing more instructions
increases complexity and requires more cross
checking. This will eventually affect the clock
rate. - There is a widening gap between processor and
memory performance - Many believe that the 8-way superscalar is the
limit and that we will reach this limit within 2
years. - Some believe VLIW will replace superscalars and
offers advantages - Because software is responsible for creating the
execution schedule, the size of the instruction
window that can be examined for parallelism is
larger than a superscalar can do in hardware - Since there is no dependence checking by the
processor VLIW hardware is simpler to implement
and may allow a faster clock.
41Reference
- The Microarchitecture of Superscalar Processors
by James E. Smith, IEEE and Gurindar S. Sohi,
senior member, IEEE
42