Title: National
1National Kapodistrian University of
AthensDep.of Informatics TelecommunicationsMSc
. In Computer Systems TechnologyAdvanced
Computer Architecture
- The Microarchitecture of Superscalar Processors,
- by J.E.Smith and G.S.Sohi
-
- Giorgos Matrozos
- M 414
- matrozos_at_ceid.upatras.gr
2- An Introduction
- Superscalar processing is the capability of
initiating multiple instructions in the same
clock cycle. - In IF phase, the results of conditional branches
are calculated earlier. - Then, we have a resolution of data dependences
and after that the instructions are distributed
to the units. - The execution begins in parallel, based on the
availability of the operands. Usually, the
sequence of the original program is not followed.
(DEFINITION) Therefore, this is called dynamic
instruction scheduling. - After this phase ends, the instructions are
replaced in the original sequential order.
3The microarchitecture of superscalar MPs
4The Instruction Processing Model A dominant
element in designing a computer architecture is
the compatibility. In superscalar processors,
this compatibility is called binary. It is the
possibility of executing programs written for
older versions or generations. At some point, it
was obvious that the Instruction Sets should be
designed to be compatible. Till now, the
sequential execution model was followed. That is,
the instructions were executed in order as they
entered the processor. But, there is the need of
defining the precise state. The processor saves
the state of the memory and the registers, at
that point of time that the interrupt occurs.
5- Elements Of High Performance
- To accomplish higher performance, we need to
decrement the execution time. The secret of
superscalar processors is to execute multiple
instructions in parallel. - (DEFINITION) The time to fetch and execute an
instruction is called Latency. - Superscalar processing contains
- Fetching strategies for simultaneous fetching of
multiple instructions and branch prediction
techniques. - Methods for determining all kinds of dependences.
- Methods for issuing multiple instructions in
parallel. - Resources for parallel execution (multiple
pipelined functional units, memory hierarchies). - Methods for handling the data in the memories.
- Methods for committing the states.
- A good superscalar processor is facing all the
above as an integrated unit and not separately.
6- Problem Solved by Superscalar MPs (I)
- The sequence of executed instructions forms a
dynamic instruction stream. The first step to
increase ILP is to overcome control dependences.
(DEFINITION) An instruction is said to be control
dependent on its preceding instruction, because
the flow must pass through the preceding first. - 1st type?due to an incremented PC
- 2nd type?due to an updated PC (branches, jumps)
- Solution of the 1st type In the static program
there are blocks. Once a block is entered in the
IF Reg, it is known that all the instructions
will be executed eventually. Any sequence of
instructions in a block can be initiated into a
conceptual window of execution. This window is
free to execute in parallel. - Solution of the 2nd type Prediction of the
outcome and speculatively fetch and execute
instructions from the predicted path.
Instructions of the predicted path are entered in
the WoE. If the prediction is correct, then the
speculative status is removed and the effect on
the state is the same as any other instruction.
If the prediction is incorrect, the speculative
execution was incorrect and recovery must be
initiated.
7- Problem Solved by Superscalar MPs (II)
- Now, we have the data dependences. They occur
among instructions because the instructions may
R/W the same storage location. - (DEFINITION) When this happens, a hazard is said
to exist. We have 3 types of hazards - RAW, WAR, WAW.
- After control and data dependences are resolved,
instructions are issued for execution. In
essence, the H/W creates a parallel execution
schedule. In this the order of instructions is
different from the sequential program. Moreover,
speculative execution means that some
instructions complete execution. But these
instructions would not have been executed at all,
if the sequential model was followed. - Let us see that in the next picture.
8(No Transcript)
9- Instruction Fetching and Branch Prediction (I)
- In superscalar MPs there is the instruction
cache. This is a memory containing recently used
instructions. This is done to reduce latency. It
is organised into blocks or lines. - The default method for IF is to increment the PC
by the number of instructions fetched and use the
incremented PC to fetch the next block. - Processing of conditional branch instructions can
be broken down into - Recognizing conditional branches Obvious!!
Some extra bits for decode info is held in the
Instruction Cache, for identifying all types of
instructions. - Determining the outcome Some predictors use
static info, like certain opcode types result
more often in taken branches or execution
statistics etc. Other predictors use dynamic
info, like the past history of branch outcomes. A
history (or prediction) table is used. Usually
two bits are used. These bits form a counter that
is incremented when the branch is taken and
decremented when is not taken.
10- Instruction Fetching and Branch Prediction (II)
- Computing branch targets Usually an integer
addition is required. In most computers, the
targets are related to the PC and use an offset.
To speed up the process, we have a branch target
buffer that holds the target address that was
used the last time the branch was executed. - Transferring control When there is a predicted
taken path, there is at least one cycle delay in
recognizing the branch, modifying the PC and
fetching instructions from the target. This may
result to pipeline bubbles. The solution is to
use the instruction buffer to mask the delay.
Some of the earlier RISC instruction sets use the
delayed branches, that is a branch did not take
effect until the instruction after the branch.
11- Decoding, Renaming and Dispatch
- This phase includes the detection and resolution
of hazards. The main job is to set up one ore
more execution tuples for each instruction. - (DEFINITION) A tuple is an ordered list
containing the operation, the storage elements
for input and the location of the output. - Often to increase ILP, there are physical storage
elements. There is the possibility of storing
multiple data there with different logical
addresses. When an instruction creates new value
for logical address, the physical one is given a
name known by the H/W. - (DEFINITION) Renaming is defined as replacing the
logical register with the new physical name. - There are 2 renaming methods
- There is a physical register file larger than
the logical one. A mapping table is used to
associate them. Renaming is performed in
sequential program order. - The second method uses a physical register file
in the same size as the logical one. There is a
buffer with one entry per active instruction.
This buffer is called the reorder buffer.
12(No Transcript)
13- Instruction Issuing and Parallel Execution
- There are 3 ways of organizing the instruction
issue buffers - Single Queue Method The register renaming is
not requiring. Operand availability can be
managed via simple reservation bits assigned to
each register. An instruction may issue if there
are no reservations on its operands. - Multiple Queue Method Instructions issue from
each queue in order. The individual queues are
organized according to instruction types (E.g. fp
queues, int queues, load/store queues). - Reservation Stations Instructions may issue
out of order. All the stations monitor their
source operands for data availability, at the
same time. The way of doing this is to hold
operand data in the reservation station.
14Organizing instr. issue queues
15Handling Memory Operations To reduce the
latency, we use memory hierarchies. Most PCs
today use caches (L1,L2). The first one is
smaller but faster, on chip. Unlike, ALU
operations, load/store instructions need address
calculation, usually an integer addition. After
that, we need an address translation to generate
a physical address. A Translation Lookaside
Buffer is used to speed up this action. Some
superscalar processors, allow single memory
operations/cycle. The trend is to allow multiple
memory requests at the same time, with
multiported memory hierarchy. Most commonly, only
the L1 cache is multiported, because requests do
not proceed to lower levels of memory. Once the
operation has been submitted to the memory
hierarchy, it may hit or miss in the data cache.
In case of missing, the accessed location must be
fetched into the cache. Miss Handling Status
Registers are used to track the status of
outstanding misses and allow multiple requests to
be overlapped.
16- The Committing State
- It is the final phase of an instruction. Its
purpose is to implement the appearance of a
sequential execution model, even though the
reality is different. The actions necessary in
this phase depend on the technique used to
recover the precise state. - 1st technique The state is saved (or
checkpointed) at certain points, in a history
buffer. Instructions update the state as they
execute and when a precise state is needed, it is
recovered from the history buffer. - 2nd technique Separation of the state into 2
parts. The implemented physical state and a
logical state. The physical is updated
immediately as the operations complete. The
logical is updated in sequential program order,
as the speculative status is cleared. The
speculative state is maintained in a reorder
buffer.
17- MIPS R10000
- Fetches 4 instructions/time. These are
predecoded when they enter the cache. - Branch prediction with a prediction table (512
lines and 2-bit counter to encode history). If
branch is taken, it takes 1cc to redirect the IF.
During this cycle, sequential instructions are
fetched and placed in a resume cache (4 Blocks).
When a branch is predicted, the processor takes a
snapshot of the register mapping table. If branch
is mispredicted, the register mapping can be
quickly recovered. - 4 instructions are dispatched into one of three
instruction queues memory, integer, fp. - Address adder, 2 int ALUs (One shifts and the
other multiplies/adds) , fp multiplier/divider/squ
are-rooter, fp adder. - On chip primary cache, L2 cache.
- Reorder buffer mechanism for maintaining a
precise state.
18- ALPHA 21164
- IF from an 8KB instruction cache. 4
instructions/time. - Instructions are issued in program order. That
restricts the instruction issue rate but
simplifies the control logic. - Branch prediction with a prediction table that
records history using 2-bit counter. This table
is in cache. - 2 int ALUs, a fp adder, a fp multiplier.
- 2 levels of cache on chip. Primary can sustain a
number of outstanding misses through six entry
miss address files (MAFs) that contains address
and target register for a load that misses. - To provide a sequential state, this processor
does not issue out of order and keeps the
instructions in sequence as they flow down the
pipeline. -
19DEC ALPHA 21164 Organization
20- AMD K5
- Uses variable length instructions, sequentially
predecoded with 5 predecode bits. - Branch prediction with one prediction
entry/cache line. It uses single-bit counter to
encode history. - 2 cycles consumed for decoding. It uses
RISC-like OPerations known as ROPs. - 2 int ALUs (one shifts, the other divides), a fp
unit, 2 load/store units, a branch unit. The
reservation stations are distributed to these
functional units. - 8KB cache.
- 16 entry reorder buffer to maintain a precise
state. -
21AMD K5 Organization