Superscalar Processors by - PowerPoint PPT Presentation

About This Presentation
Title:

Superscalar Processors by

Description:

Superscalar Processors by Sherri Sparks Overview What are superscalar processors? Program Representation, Dependencies, & Parallel Execution Micro architecture of a ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 43
Provided by: Clande6
Learn more at: http://www.cs.ucf.edu
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Processors by


1
Superscalar Processors by
  • Sherri Sparks

2
Overview
  1. What are superscalar processors?
  2. Program Representation, Dependencies, Parallel
    Execution
  3. Micro architecture of a typical superscalar
    processor
  4. A look at 3 superscalar implementations
  5. Conclusion The future of superscalar processing

3
What are superscalars and how do they differ
from pipelines?
  • In simple pipelining, you are limited to fetching
    1 single instruction into the pipeline per clock
    cycle. This causes a performance bottleneck.
  • Superscalar processors overcome the 1 instruction
    per clock cycle limit of simple pipelines and
    possess the ability to fetch multiple
    instructions during the same clock cycle. They
    also employ advanced techniques like branch
    prediction to ensure an uninterrupted stream of
    instructions.

4
Development History of Superscalars
  • Pipelining was developed in the late 1950s and
    became popular in the 1960s.
  • Examples of early pipelined architectures are the
    CDC 6600 and the IBM 360/91 (Tomasulos
    algorithm)
  • Superscalars appeared in the mid to late 1980s

5
Instruction Processing Model
  • Need to maintain software compatibility.
  • The assembly instruction set was the level chosen
    to maintain compatibility because it did not
    affect existing software.
  • Need to maintain at least a semblance of a
    sequential execution model for programmers who
    rely on the concept of sequential execution in
    software design.
  • A superscalar processor may execute instructions
    out of order at the hardware level, but execution
    must appear sequential at the programming
    level.

6
Superscalar Implementation
  • Instruction fetch strategies that simultaneously
    fetch multiple instructions often by using branch
    prediction techniques.
  • Methods for determining data dependencies and
    keeping track of register values during execution
  • Methods for issuing multiple instructions in
    parallel
  • Resources for parallel execution of many
    instructions including multiple pipelined
    functional units and memory hierarchies capable
    of simultaneously servicing multiple memory
    references.
  • Methods for communicating data values through
    memory through load and store instructions.
  • Methods for committing the process state in
    correct order. This is to maintain the outward
    appearance of sequential execution.

7
From Sequential to Parallel
  • Parallel execution often results in instructions
    completing non sequentially.
  • Speculative execution means that some
    instructions may be executed when they would not
    have been executed at all according to the
    sequential model (i.e. incorrect branch
    prediction).
  • To maintain the outward appearance of sequential
    execution for the programmer, storage cannot be
    updated immediately. The results must be held in
    temporary status until the storage us updated.
    Meanwhile, these temporary results must be usable
    by dependant instructions.
  • When its determined that the sequential model
    would have executed an instruction, the temporary
    results are made permanent by updating the
    outward state of the machine. This process is
    called committing the instruction.

8
Dependencies
  • Parallel Execution introduces 2 types of
    dependencies
  • Control dependencies due to incrementing or
    updating the program counter in response to
    conditional branch instructions.
  • Data dependencies due to resource contention as
    instructions may need to read / write to the same
    storage or memory locations.

9
Overcoming Control Dependencies Example
  • L2 mov r3,r7
  • lw r8,(r3)
  • add r3,r3,4
  • lw r9,(r3)
  • ble r8,r9,L3
  • move r3,r7
  • sw r9,(r3)
  • add r3,r3,4
  • sw r8,(r3)
  • add r5,r5,1
  • L3 add r6,r6,1
  • add r7,r7,4
  • blt r6,r4,L2
  • Blocks are issued are initiated into the window
    of execution.

Block 1
Block 2
Block 3
10
Control Dependencies Branch Predicition
  • To gain the most parallelism, control
    dependencies due to conditional branches has to
    be overcome.
  • Branch prediction attempts to overcome this by
    predicting the outcome of a branch and
    speculatively fetching and executing instructions
    from the predicted path.
  • If the predicted path is correct, the speculative
    status of the instructions is removed and they
    affect the state of the machine like any other
    instruction.
  • If the predicted path is wrong, then recovery
    actions are taken so as not to incorrectly modify
    the state of the machine.

11
Data Dependencies
  • Data dependencies occur because instructions may
    access the same register or memory location
  • 3 Types of data dependencies or hazards
  • RAW (read after write) occurs because a later
    instruction can only read a value after a
    previous instruction has written it.
  • WAR (write after read) occurs when an
    instruction needs to write a new value into a
    storage location but must wait until all
    preceding instructions needing to read the old
    value have done so.
  • WAW (write after write) occurs when multiple
    instructions update the same storage location it
    must appear that these updates occur in the
    proper sequence.

12
Data Dependency Example
  • mov r3,r7
  • lw r8,(r3)
  • add r3,r3,4
  • lw r9,(r3)
  • ble r8,r9,L3

RAW
WAW
WAR
13
Parallel Execution Method
  1. Instructions are fetched using branch prediction
    to form a dynamic stream of instructions
  2. Instructions are examined for dependencies and
    dependencies are removed
  3. Examined instructions are dispatched to the
    window of execution (These instructions are no
    longer in sequential order, but are ordered
    according to their data dependencies.
  4. Instructions are issued from the window in an
    order determined by their dependencies and
    hardware resource availability.
  5. Following execution, instructions are put back
    into their sequential program order and then
    committed so their results update the machine
    state.

14
Superscalar Microarchitecture
  • Parallel Execution Method Summarized in 5 phases
  • 1. Instruction Fetch Branch Prediction
  • 2. Decode Register Dependence Analysis
  • 3. Issue Execution
  • 4. Memory Operation Analysis Execution
  • 5. Instruction Reorder Commit

15
Superscalar Microarchitecture
16
Instruction Fetch Branch Prediction
  • Fetch phase must fetch multiple instructions per
    cycle from cache memory to keep a steady feed of
    instructions going to the other stages.
  • The number of instructions fetched per cycle
    should match or be greater than the peak
    instruction decode execution rate (to allow for
    cache misses or occasions where the max of
    instructions cant be fetched)
  • For conditional branches, fetch mechanism must be
    redirected to fetch instructions from branch
    targets.
  • 4 steps to processing conditional branch
    instructions
  • 1. Recognizing that in instruction is a
    conditional branch
  • 2. Determining the branch outcome (taken or not
    taken)
  • 3. Computing the branch target
  • 4. Transferring control by redirecting
    instruction fetch (as in the case of a taken
    branch)

17
Processing Conditional Branches
  • STEP 1 Recognizing Conditional Branches
  • Instruction decode information is held in the
    instruction cache. These extra bits are used to
    identify the basic instruction types.

18
Processing Conditional Branches
  • STEP 2 Determining Branch Outcome
  • Static Predictions (information determined from
    static binary). Ex Certain opcode types might
    result in more branches taken than others or a
    backwards branch direction might be more likely
    in loops.
  • Predictions based on profiling information
    (execution statistics collected during a previous
    run of the program).
  • Dynamic Predictions (information gathered during
    program execution about past history of branch
    outcomes). Branch history outcomes are stored in
    a branch history table or a branch prediction
    table.

19
Processing Conditional Branches
  • STEP 3 Computing Branch Targets
  • Branch targets are usually relative to the
    program counter and are computed as
  • branch target program counter offset
  • Finding target addresses can be sped up by having
    a branch target buffer which holds the target
    address used the last time the branch was
    executed.
  • EX Branch Target Address Cache used in PowerPC
    604

20
Processing Conditional Branches
  • STEP 4 Transferring Control
  • Problem Thee is often a delay in recognizing a
    branch, modifying the program counter and
    fetching the target instructions.
  • Several Solutions
  • Use the stockpiled instructions in the
    instructions buffer to mask the delay
  • Use a buffer that contains instructions from both
    taken and not taken branch paths
  • Delayed Branches Branch does not take effect
    until instruction after the branch. This allowed
    the fetch of target instructions to overlap
    execution of the instruction following the
    branch. The also introduce assumptions about
    pipeline structure and therefore delayed branches
    are rarely used anymore.

21
Instruction Decoding, Renaming, Dispatch
  • Instructions are removed from the fetch buffers,
    decoded and examined for control and data
    dependencies.
  • Instructions are dispatched to buffers associated
    with hardware functional units for later issuing
    and execution.

22
Instruction Decoding
  • The decode phase sets up execution tuples for
    each instruction.
  • An execution tuple contains
  • An operation to be executed
  • The identities of storage elements where input
    operands will eventually reside
  • The locations where an instructions result must
    be placed

23
Register Renaming
  • Used to eliminate WAW and RAW dependencies.
  • 2 Types
  • Physical register file is larger than logical
    register file and a mapping table is used to
    associate physical register values with logical
    register values. Physical registers are assigned
    from a free list.
  • Reorder Buffer Uses the same size physical and
    logical register files. There is also a reorder
    buffer that contains 1 entry per active
    instruction and maintains the sequential ordering
    of instructions. It is a circular queue
    implemented in hardware. As instructions are
    dispatched they enter the queue at the tail. As
    instructions complete, their results are inserted
    into their assigned locations in the reorder
    buffer. When an instructions reaches the head of
    the queue, its entry is removed and its result
    placed in the register file.

24
Register Renaming I
25
Register Renaming II(using a reorder buffer)
26
Instruction Issuing Parallel Execution
  • Instruction issuing is defined as the run-time
    checking for availability of data and resources.
  • Constraints on instruction issue
  • Availability of physical resources like
    instruction units, interconnect, and register
    file
  • Organization of buffers holding execution tuples

27
Single Queue Method
  • If there is no out of order issuing, operand
    availability can be managed via reservation bits
    assigned to each register.
  • A register is reserved when an instruction
    modifying the register issues.
  • A register is cleared when the instruction
    completes.
  • Instructions may issue if there are no
    reservations on its operands.

28
Multiple Queue Method
  • There are multiple queues organized according to
    instruction type.
  • Instructions issue from individual queues in
    sequential order.
  • Individual queues may issue out of order with
    respect to one another.

29
Reservation Stations
  • Instructions issue out of order
  • Reservation stations hold information about
  • source operands for an operation.
  • When all operands are present, the instruction
    may issue.
  • Reservation stations may be partitioned according
    to instruction type or pooled into a single large
    block.

30
Memory Operation Analysis Execution
  • To reduce latency, memory hierarchies are used
    may contain primary and secondary caches.
  • Address translation to physical addresses is
    improved by using a translation lookaside
    buffer which contains a cache of recently
    accessed pages.
  • Multiported memory hierarchy is used to allow
    multiple memory requests to be serviced
    simultaneously. Multiporting is achieved by
    having multiple memory banks or making multiple
    serial requests during the same cycle.
  • Store address buffers are used to make sure
    memory operations dont violate hazard
    conditions. Store address buffers contain the
    addresses of all pending store operations.

31
Memory Hazard Detection
32
Instruction Reorder Commit
  • When an instruction is committed, its result is
    allowed to modify the logical state of the
    machine.
  • The purpose of the commit phase is to maintain
    the illusion of a sequential execution model.
  • 2 methods
  • 1. The state of the machine is saved in a
    history buffer. Instruction update the state of
    the machine as they execute and when there is a
    problem, the state of the machine can be
    recovered from the history buffer. The commit
    phase gets rid of the history state thats no
    longer needed.
  • 2. The state of the machine is separated into a
    physical state and a logical state. The physical
    state is updated in memory as instructions
    complete. The logical state is updated in a
    sequential order as the speculative status of
    instructions is cleared. The speculative state is
    maintained in a reorder buffer and during the
    commit phase, the result of an operation is moved
    from the reorder buffer to a logical register or
    memory.

33
The Role of Software
  • Superscalars can be made more efficient if
    parallelism in software can be increased.
  • 1. By increasing the likelihood that a group of
    instructions can be issued simultaneously
  • 2. By decreasing the likelihood that an
    instruction has to wait for the result of a
    previous instruction

34
A Look At 3 Superscalar Processors
  1. MIPS R10000
  2. DEC Alpha 21164
  3. AMD K5

35
MIPS R10000
  • Typical superscalar processor
  • Able to fetch 4 instructions at a time
  • Uses predecode to generate bits to assist with
    branch prediction (512 entry prediction table)
  • Resume cache is used to fetch not taken
    instructions and has space to handle 4 branch
    predictions at a time
  • Register renaming uses a physical register file
    2x the size of the logical register file.
    Physical registers are allocated from a free list
  • 3 instruction queues memory, integer, and
    floating point
  • 5 functional units (an address adder, 2 integer
    ALUs, a floating point multiplier / divider /
    square rooter, floating point adder)
  • Supports on-chip primary data cache (32 KB, 2 way
    set associative) and an off-chip secondary cache.
  • Uses reorder buffer mechanism to maintain machine
    state during execptions.
  • Instructions are committed 4 at a time

36
Alpha 21164
  • Simple superscalar that forgoes the advantage of
    dynamic scheduling in favor of a high clock rate
  • 4 Instructions at a time are fetched from an 8K
    instruction cache
  • 2 instruction buffers that issue instructions in
    program order
  • Branches are predicted using a history table
    associated with the instruction cache
  • Uses the single queue method of instruction
    issuing
  • 4 functional units (2 ALUs, a floating point
    adder, and a floating point multiplier)
  • 2 level cache memory (primary 8K cache
    secondary 96 K 3way set associative cache)
  • Sequential machine state is maintained during
    interrupts because instructions are not issued
    out of order
  • The pipeline functions as a simple reorder buffer
    since instructions in the pipeline are maintained
    in sequential order

37
Alpha 21164 Superscalar Organization
38
AMD-K5
  • Implements the complex Intel x86 instruction set
  • Use 5 pre-decode bits for decoding variable
    length instructions
  • Instructions are fetched from the instruction
    cache at a rate of 16 bytes / cycle placed in a
    16 element queue.
  • Branch prediction is integrated with the
    instruction cache. There is 1 prediction entry
    per cache line.
  • Due to instruction set complexity, 2 cycles are
    required to decode
  • Instructions are converted to ROPS (simple risc
    like operations)
  • Instructions read operand data are dispatched
    to functional unit reservation stations
  • There are 6 functional units 2 integer ALUs, 1
    floating point unit, 2 load/ store units a
    branch unit.
  • Up to 4 ROPs can be issued per clock cycle
  • Has an 8K data cache with 4 banks. Dual load/
    stores are allowed to different banks.
  • 16 entry reorder buffer maintains machine state
    when there is an exception and recovers from
    incorrect branch predictions

39
AMD K5 Superscalar Organization
40
The Future of Superscalar Processing
  • Superscalar design performance gain
  • BUT increasing hardware parallelism may be a case
    of diminishing returns.
  • There are limits to instruction level parallelism
    in programs that can be exploited.
  • Simultaneously issuing more instructions
    increases complexity and requires more cross
    checking. This will eventually affect the clock
    rate.
  • There is a widening gap between processor and
    memory performance
  • Many believe that the 8-way superscalar is the
    limit and that we will reach this limit within 2
    years.
  • Some believe VLIW will replace superscalars and
    offers advantages
  • Because software is responsible for creating the
    execution schedule, the size of the instruction
    window that can be examined for parallelism is
    larger than a superscalar can do in hardware
  • Since there is no dependence checking by the
    processor VLIW hardware is simpler to implement
    and may allow a faster clock.

41
Reference
  • The Microarchitecture of Superscalar Processors
    by James E. Smith, IEEE and Gurindar S. Sohi,
    senior member, IEEE

42
  • ?
Write a Comment
User Comments (0)
About PowerShow.com