National - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

National

Description:

(DEFINITION) The time to fetch and execute an instruction is called Latency. ... Instruction Fetching and Branch Prediction (I) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 22
Provided by: stud149
Category:

less

Transcript and Presenter's Notes

Title: National


1
National Kapodistrian University of
AthensDep.of Informatics TelecommunicationsMSc
. In Computer Systems TechnologyAdvanced
Computer Architecture
  • The Microarchitecture of Superscalar Processors,
  • by J.E.Smith and G.S.Sohi
  • Giorgos Matrozos
  • M 414
  • matrozos_at_ceid.upatras.gr

2
  • An Introduction
  • Superscalar processing is the capability of
    initiating multiple instructions in the same
    clock cycle.
  • In IF phase, the results of conditional branches
    are calculated earlier.
  • Then, we have a resolution of data dependences
    and after that the instructions are distributed
    to the units.
  • The execution begins in parallel, based on the
    availability of the operands. Usually, the
    sequence of the original program is not followed.
    (DEFINITION) Therefore, this is called dynamic
    instruction scheduling.
  • After this phase ends, the instructions are
    replaced in the original sequential order.

3
The microarchitecture of superscalar MPs
4
The Instruction Processing Model A dominant
element in designing a computer architecture is
the compatibility. In superscalar processors,
this compatibility is called binary. It is the
possibility of executing programs written for
older versions or generations. At some point, it
was obvious that the Instruction Sets should be
designed to be compatible. Till now, the
sequential execution model was followed. That is,
the instructions were executed in order as they
entered the processor. But, there is the need of
defining the precise state. The processor saves
the state of the memory and the registers, at
that point of time that the interrupt occurs.
5
  • Elements Of High Performance
  • To accomplish higher performance, we need to
    decrement the execution time. The secret of
    superscalar processors is to execute multiple
    instructions in parallel.
  • (DEFINITION) The time to fetch and execute an
    instruction is called Latency.
  • Superscalar processing contains
  • Fetching strategies for simultaneous fetching of
    multiple instructions and branch prediction
    techniques.
  • Methods for determining all kinds of dependences.
  • Methods for issuing multiple instructions in
    parallel.
  • Resources for parallel execution (multiple
    pipelined functional units, memory hierarchies).
  • Methods for handling the data in the memories.
  • Methods for committing the states.
  • A good superscalar processor is facing all the
    above as an integrated unit and not separately.

6
  • Problem Solved by Superscalar MPs (I)
  • The sequence of executed instructions forms a
    dynamic instruction stream. The first step to
    increase ILP is to overcome control dependences.
    (DEFINITION) An instruction is said to be control
    dependent on its preceding instruction, because
    the flow must pass through the preceding first.
  • 1st type?due to an incremented PC
  • 2nd type?due to an updated PC (branches, jumps)
  • Solution of the 1st type In the static program
    there are blocks. Once a block is entered in the
    IF Reg, it is known that all the instructions
    will be executed eventually. Any sequence of
    instructions in a block can be initiated into a
    conceptual window of execution. This window is
    free to execute in parallel.
  • Solution of the 2nd type Prediction of the
    outcome and speculatively fetch and execute
    instructions from the predicted path.
    Instructions of the predicted path are entered in
    the WoE. If the prediction is correct, then the
    speculative status is removed and the effect on
    the state is the same as any other instruction.
    If the prediction is incorrect, the speculative
    execution was incorrect and recovery must be
    initiated.

7
  • Problem Solved by Superscalar MPs (II)
  • Now, we have the data dependences. They occur
    among instructions because the instructions may
    R/W the same storage location.
  • (DEFINITION) When this happens, a hazard is said
    to exist. We have 3 types of hazards
  • RAW, WAR, WAW.
  • After control and data dependences are resolved,
    instructions are issued for execution. In
    essence, the H/W creates a parallel execution
    schedule. In this the order of instructions is
    different from the sequential program. Moreover,
    speculative execution means that some
    instructions complete execution. But these
    instructions would not have been executed at all,
    if the sequential model was followed.
  • Let us see that in the next picture.

8
(No Transcript)
9
  • Instruction Fetching and Branch Prediction (I)
  • In superscalar MPs there is the instruction
    cache. This is a memory containing recently used
    instructions. This is done to reduce latency. It
    is organised into blocks or lines.
  • The default method for IF is to increment the PC
    by the number of instructions fetched and use the
    incremented PC to fetch the next block.
  • Processing of conditional branch instructions can
    be broken down into
  • Recognizing conditional branches Obvious!!
    Some extra bits for decode info is held in the
    Instruction Cache, for identifying all types of
    instructions.
  • Determining the outcome Some predictors use
    static info, like certain opcode types result
    more often in taken branches or execution
    statistics etc. Other predictors use dynamic
    info, like the past history of branch outcomes. A
    history (or prediction) table is used. Usually
    two bits are used. These bits form a counter that
    is incremented when the branch is taken and
    decremented when is not taken.

10
  • Instruction Fetching and Branch Prediction (II)
  • Computing branch targets Usually an integer
    addition is required. In most computers, the
    targets are related to the PC and use an offset.
    To speed up the process, we have a branch target
    buffer that holds the target address that was
    used the last time the branch was executed.
  • Transferring control When there is a predicted
    taken path, there is at least one cycle delay in
    recognizing the branch, modifying the PC and
    fetching instructions from the target. This may
    result to pipeline bubbles. The solution is to
    use the instruction buffer to mask the delay.
    Some of the earlier RISC instruction sets use the
    delayed branches, that is a branch did not take
    effect until the instruction after the branch.

11
  • Decoding, Renaming and Dispatch
  • This phase includes the detection and resolution
    of hazards. The main job is to set up one ore
    more execution tuples for each instruction.
  • (DEFINITION) A tuple is an ordered list
    containing the operation, the storage elements
    for input and the location of the output.
  • Often to increase ILP, there are physical storage
    elements. There is the possibility of storing
    multiple data there with different logical
    addresses. When an instruction creates new value
    for logical address, the physical one is given a
    name known by the H/W.
  • (DEFINITION) Renaming is defined as replacing the
    logical register with the new physical name.
  • There are 2 renaming methods
  • There is a physical register file larger than
    the logical one. A mapping table is used to
    associate them. Renaming is performed in
    sequential program order.
  • The second method uses a physical register file
    in the same size as the logical one. There is a
    buffer with one entry per active instruction.
    This buffer is called the reorder buffer.

12
(No Transcript)
13
  • Instruction Issuing and Parallel Execution
  • There are 3 ways of organizing the instruction
    issue buffers
  • Single Queue Method The register renaming is
    not requiring. Operand availability can be
    managed via simple reservation bits assigned to
    each register. An instruction may issue if there
    are no reservations on its operands.
  • Multiple Queue Method Instructions issue from
    each queue in order. The individual queues are
    organized according to instruction types (E.g. fp
    queues, int queues, load/store queues).
  • Reservation Stations Instructions may issue
    out of order. All the stations monitor their
    source operands for data availability, at the
    same time. The way of doing this is to hold
    operand data in the reservation station.

14
Organizing instr. issue queues
15
Handling Memory Operations To reduce the
latency, we use memory hierarchies. Most PCs
today use caches (L1,L2). The first one is
smaller but faster, on chip. Unlike, ALU
operations, load/store instructions need address
calculation, usually an integer addition. After
that, we need an address translation to generate
a physical address. A Translation Lookaside
Buffer is used to speed up this action. Some
superscalar processors, allow single memory
operations/cycle. The trend is to allow multiple
memory requests at the same time, with
multiported memory hierarchy. Most commonly, only
the L1 cache is multiported, because requests do
not proceed to lower levels of memory. Once the
operation has been submitted to the memory
hierarchy, it may hit or miss in the data cache.
In case of missing, the accessed location must be
fetched into the cache. Miss Handling Status
Registers are used to track the status of
outstanding misses and allow multiple requests to
be overlapped.
16
  • The Committing State
  • It is the final phase of an instruction. Its
    purpose is to implement the appearance of a
    sequential execution model, even though the
    reality is different. The actions necessary in
    this phase depend on the technique used to
    recover the precise state.
  • 1st technique The state is saved (or
    checkpointed) at certain points, in a history
    buffer. Instructions update the state as they
    execute and when a precise state is needed, it is
    recovered from the history buffer.
  • 2nd technique Separation of the state into 2
    parts. The implemented physical state and a
    logical state. The physical is updated
    immediately as the operations complete. The
    logical is updated in sequential program order,
    as the speculative status is cleared. The
    speculative state is maintained in a reorder
    buffer.

17
  • MIPS R10000
  • Fetches 4 instructions/time. These are
    predecoded when they enter the cache.
  • Branch prediction with a prediction table (512
    lines and 2-bit counter to encode history). If
    branch is taken, it takes 1cc to redirect the IF.
    During this cycle, sequential instructions are
    fetched and placed in a resume cache (4 Blocks).
    When a branch is predicted, the processor takes a
    snapshot of the register mapping table. If branch
    is mispredicted, the register mapping can be
    quickly recovered.
  • 4 instructions are dispatched into one of three
    instruction queues memory, integer, fp.
  • Address adder, 2 int ALUs (One shifts and the
    other multiplies/adds) , fp multiplier/divider/squ
    are-rooter, fp adder.
  • On chip primary cache, L2 cache.
  • Reorder buffer mechanism for maintaining a
    precise state.

18
  • ALPHA 21164
  • IF from an 8KB instruction cache. 4
    instructions/time.
  • Instructions are issued in program order. That
    restricts the instruction issue rate but
    simplifies the control logic.
  • Branch prediction with a prediction table that
    records history using 2-bit counter. This table
    is in cache.
  • 2 int ALUs, a fp adder, a fp multiplier.
  • 2 levels of cache on chip. Primary can sustain a
    number of outstanding misses through six entry
    miss address files (MAFs) that contains address
    and target register for a load that misses.
  • To provide a sequential state, this processor
    does not issue out of order and keeps the
    instructions in sequence as they flow down the
    pipeline.

19
DEC ALPHA 21164 Organization
20
  • AMD K5
  • Uses variable length instructions, sequentially
    predecoded with 5 predecode bits.
  • Branch prediction with one prediction
    entry/cache line. It uses single-bit counter to
    encode history.
  • 2 cycles consumed for decoding. It uses
    RISC-like OPerations known as ROPs.
  • 2 int ALUs (one shifts, the other divides), a fp
    unit, 2 load/store units, a branch unit. The
    reservation stations are distributed to these
    functional units.
  • 8KB cache.
  • 16 entry reorder buffer to maintain a precise
    state.

21
AMD K5 Organization
Write a Comment
User Comments (0)
About PowerShow.com