The anatomy of a modern superscalar processor - PowerPoint PPT Presentation

About This Presentation
Title:

The anatomy of a modern superscalar processor

Description:

... typical superscalar processor fetches and decodes several instructions at a time. ... For each instruction, decode phase sets up the operation to be executed, the ... – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 32
Provided by: cs056
Category:

less

Transcript and Presenter's Notes

Title: The anatomy of a modern superscalar processor


1
The anatomy of a modern superscalar processor
  • Constantinos Kourouyiannis
  • Madhava Rao Andagunda

2
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation

3
Introduction
  • Superscalar processing is the ability to initiate
    multiple instructions during the same cycle.
  • It aims at producing ever faster microprocessors.
  • A typical superscalar processor fetches and
    decodes several instructions at a time.
    Instructions are executed in parallel based on
    the availability of operand data rather than
    their original program sequence. Upon completion
    instructions are re-sequenced so that they can be
    used to update the process state in the correct
    program order.

4
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation

5
Microarchitecture
  • Instruction Fetch and Branch Prediction
  • Decode and Register Dependence Analysis
  • Issue and Execution
  • Memory Operation Analysis and Execution
  • Instruction Reorder and Commit

6
Organization of a superscalar processor
7
Instruction Fetch and Branch Prediction
  • The fetch phase supplies instructions to the rest
    of the processing pipeline.
  • An instruction cache is used to reduce the
    latency and increase the bandwidth of instruction
    fetch process.
  • PC searches the cache contents to determine if
    the instruction being addressed is present in one
    of the cache lines.
  • In a superscalar implementation, the fetch phase
    fetches multiple instructions per cycle from
    cache memory.

8
Branch Instructions
  • Recognizing conditional branches
  • Decode information (extra bits) is held in the
    instruction cache with every instruction
  • Determining the branch outcome
  • Branch prediction using information regarding
    past history of branch outcomes.
  • Computing the branch target
  • Usually integer addition (PC offset value)
  • Branch Target Buffer holds target address used
    last time the branch was executed
  • Transferring control
  • If branch taken ? at least one clock cycle delay
    to recognize branch, modify PC and fetch
    instructions from target address

9
Instruction Decode
  • Instructions are removed from fetch buffers,
    examined and data dependence linkages are set up.
  • Data dependences
  • True dependences can cause a read after write
    (RAW) hazard
  • Artificial dependences can cause write after
    read (WAR) and write after write (WAW) hazards.
  • Hazards
  • RAW occurs when a consuming instruction reads a
    value before the producing instruction writes it.
  • WAR occurs when an instruction writes a value
    before a preceding instruction reads it.
  • WAW occurs when multiple instructions update the
    same storage location but not in the proper
    order.

10
Instruction Decode (cont.)
  • Example of data hazards
  • For each instruction, decode phase sets up the
    operation to be executed, the identities of
    storage elements where input reside and the
    locations where result must be placed.
  • Artificial dependences are eliminated through
    register renaming.

11
Instruction Issue and Parallel Execution
  • Run-time checking for availability of data and
    resources.
  • An instruction is ready to execute as soon as its
    input operands are available. However, there are
    other constraints such as execution units and
    register file ports.
  • An issue queue is responsible for holding the
    instructions until their input operands are
    available.
  • Out-of-order execution the ability of executing
    instructions not in the program order but as soon
    as their operands are ready.

12
Handling Memory Operations
  • For memory operations, the decode phase cannot
    identify the memory locations that will be
    accessed.
  • The determination of the memory location that
    will be accessed requires an address calculation,
    usually integer addition.
  • Once a valid address is obtained, the load or
    store operation is submitted to memory.

13
Committing State
  • The effects of an instruction are allowed to
    modify the logical process state.
  • The purpose of this phase is to implement the
    appearance of a sequential execution model, even
    though the actual execution is not sequential.
  • Machine state is separated into physical and
    logical. Physical state is updated as the
    operations complete while logical is updated in
    sequential program order.

14
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation

15
Alpha 21264 (EV6)
16
Instruction Fetch
  • Fetches 4 instructions per cycle
  • Large 64 KB 2-way associative instruction cache
  • Branch predictor dynamically chooses between
    local and global history

17
Register Renaming
  • Assignment of a unique storage location with each
    write-reference to a register.
  • The register allocated becomes part of the
    architectural state only when the instruction
    commits.
  • Elimination of WAW and WAR dependences but
    preservation of RAW dependences necessary for
    correct computation.
  • 64 architectural 41 integer 41 floating point
    registers available to hold speculative results
    prior to instruction retirement in an 80
    instruction in-flight window.

18
Issue Queues
  • 20 entry integer queue
  • can issue 4 instructions per cycle
  • 15 entry floating-point queue
  • can issue 2 instructions per cycle
  • A list of pending instructions is kept and each
    cycle these queues select from these instructions
    as their input data are ready.
  • Queues issue instructions speculatively and older
    instructions are given priority over newer in the
    queue.
  • An issue queue entry becomes available when the
    instruction issues or is squashed due to
    mis-speculation.

19
Execution Engine
  • All execution units require access to the
    register file.
  • The register file is split into two clusters that
    contain duplicates of the 80-entry register file.
  • Two pipes access a single register file to form a
    cluster and the two clusters are combined to
    support 4-way integer execution.
  • Two floating point execution pipes are organized
    in a single cluster with a single 72-entry
    register file.

20
Memory System
  • Supports in-flight memory references and
    out-of-order operation
  • Receives up to 2 memory operations from the
    integer execution pipes every cycle
  • 64 KB 2-way set associative data cache and direct
    mapped level-two cache (ranges from 1 to 16 MB)
  • 3-cycle latency for integer loads and 4 cycles
    for FP loads

21
Store/Load Memory Ordering
  • Memory system supports capabilities of
    out-of-order execution but maintains an in-order
    architectural memory model.
  • It would be wrong if a later load issued prior to
    an earlier store to the same address.
  • This RAW memory dependency cannot be handled by
    rename logic because it doesnt know the memory
    address before instruction issue.
  • If a load is incorrectly issued before an earlier
    store to the same address, the 21264 trains the
    out-of-order execution core to avoid it on
    subsequent executions of the same load. It sets a
    bit on a load wait table, that forces the issue
    point of the load to be delayed until all prior
    stores have issued.

22
Load Hit/ Miss Prediction
  • To achieve the 3-cycle integer load hit latency,
    it is necessary to speculatively issue consumers
    of integer load data before knowing if the load
    hit or missed in the data cache.
  • If the load eventually misses, two integer cycles
    are squashed and all integer instructions that
    issued during those cycles are pulled back in the
    issue queue to be re-issued later.
  • The 21264 predicts when loads will miss and does
    not speculatively issue the consumers of the load
    in that case. Effective load latency 5 cycles
    for an integer load hit that is incorrectly
    predicted to miss.

23
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation

24
Sim-alpha simulator
  • Sim-alpha is a simulator that models Alpha 21264.
  • It models the implementation constraints and
    low-level features in the 21264.
  • Allows user to vary the different parameters of
    the processor, such as fetch width, reorder
    buffer size and issue queue sizes.

25
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation

26
Out-Of-Order Execution
  • Why Out-Of-Order Execution?
  • - In-order Processors Stalls
  • Pipeline may not be full because
    of the frequent stalls
  • Example

In the Out-Of-Order Processors - No
dependency Move the Instruction for execution
- Means Allow the Instructions that are Ready
27
Out-Of-Order Execution
  • Theme Stall Only on RAW hazards Structural
    Hazards
  • - RAW Hazard
  • Example LD R4 10(R5)
  • ADD R6 R4 R8
  • - Structural Hazard
  • - Occurs because of Resource Conflicts
  • - Example If the CPU is designed with
    a single Interface to Memory
  • This Interface is
    always used during IF
  • Also used in MEM for
    LOAD or STORE Operations.
  • When a Load or Store
    gets into MEM stage IF must stall

28
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation

29
Prediction Speculation
  • Problem Serialization due to Dependences
  • - Wait Until Producing Instruction Execute
  • - But We need Optimal Utilization of Resources
  • Solution
  • - Predict Unknown information
  • - Allow the processor to proceed
  • (Assuming predicted information is correct)
  • - If the Prediction is Correct Proceed
    Normally
  • - If not squash the speculatively executed
    instructions
  • - Restore the status
  • - Start In the correct Direction

30
Taxonomy Of Speculation
Two Possibilities -Worst case 50 accuracy
For a 32 bit Register Possibilities - Worst
case ?
31
Prediction Speculation
  • Control Speculation
  • - Current Branch Predictors are highly
    Accurate
  • - Implemented in Commercial
    Processors
  • Data Speculation
  • - Lot of research (Value Profiling,
    Value prediction etc)
  • - Not Implemented in the Current
    Processors
  • - Till Now Very Little Effects on
    Current Designs
Write a Comment
User Comments (0)
About PowerShow.com