Superscalar Pipeline Architectures - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Superscalar Pipeline Architectures

Description:

Superscalar architectures can process multiple ... Allows for instruction execution rate to exceed the clock rate (CPI of less than ... Kish & Preiss. ... – PowerPoint PPT presentation

Number of Views:882
Avg rating:3.0/5.0
Slides: 37
Provided by: mattos4
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Pipeline Architectures


1
Superscalar Pipeline Architectures
  • By Matthew Osborne, Philip Ho, Xun Chen
  • April 19, 2004

2
Superscalar Architecture
  • Relatively new, first appeared in early 1990s
  • Builds on the concept of pipelining
  • Superscalar architectures can process multiple
    instructions in one clock cycle (multiple
    instruction execution units)
  • Allows for instruction execution rate to exceed
    the clock rate (CPI of less than 1)

3
Overview of Selected Superscalar Architectures
  • Intel
  • MIPS
  • PowerPC
  • T 1000 Architectures
  • Hobbes A Multi-threaded superscalar

4
Intel Superscalar Architecture
  • According to Sara Sarimento, in her essay Recent
    History of Intel Architecture A Refresher
  • - Intels first use of a superscalar
    architecture was its Pentium Processor
  • - Instruction Level Parallelism - instructions
    independent of the outcome of one another execute
    concurrently to utilize more of the available
    hardware resources and increase instruction
    throughput.

5
Intel P5 Microarchitecture
  • Used in initial Pentium processor
  • Could execute up to 2 instructions simultaneously
  • Instructions sent through the pipeline in order -
    if the next two instructions had a dependency
    issue, only one instruction (pipe) would be
    executed and the second execution unit (pipe)
    went unused for that clock cycle.

6
Intel P6 Microarchitecture
  • - Used in the Pentium II, III and Pro processors
  • 3 instruction decoders, which break each CISC
    instruction (macro-op) into equivalent
    micro-operations (µops) for the Out-of-Order
    Execution unit
  • 10 stage instruction pipeline utilized in this
    architecture

7
Intel P6 Microarchitecture
  • Out of Order instruction execution - executes
    instructions without data dependency issues out
    of order for a higher level of hardware
    utilization
  • Scheduler unit resolves data dependency issues
    between individual instructions
  • Re-Order Buffer puts instructions back in order
    before writing them back to memory
  • Up to 3 instructions can be retired concurrently
    to memory

8
Intel NetBurst MicroArchitecture
  • New architecture used for the Intel Pentium IV
    and Pentium Xeon processors

9
Intel NetBurst Microarchitecture
  • Changes from P6 Architecture
  • Only one instruction decoder present
  • Decoder moved outside the Out-of-Order Execution
    Unit an Execution Trace Cache was added in its
    place
  • Increased number of pipeline stages to 20
  • Improved branch prediction algorithms
  • ALUs operate twice as quickly as their P6
    counterparts

10
Intel NetBurst Microarchitecture
  • Execution Trace Cache
  • Alleviates delays in fetching and translating
    CISC instructions to their appropriate µops
  • Instructions are now decoded by a translation
    engine, with the resulting µops stored as traces
    (sequence of µops) in the Execution trace cache.
  • Traces stored in path of predicted program
    execution flow, with results of branches in the
    code integrated into this path
  • Delivers up to 3 µops to the core of the
    Execution Unit per clock cycle

11
Intel NetBurst Microarchitecture
  • Branch Prediction
  • Branch targets are predicted based on their
    linear address using branch prediction logic and
    fetched as soon as possible
  • Targets are fetched from the Execution Trace
    Cache if cached there otherwise they are fetched
    form the memory hierarchy
  • Downside despite the improved prediction
    algorithm, one of the biggest costs of this
    architecture is mispredicted branches because of
    the longer instruction pipeline than previous
    architectures.

12
MIPS Superscalar Architecture
  • MIPS is a RISC instruction platform, versus
    Intels CISC instruction platform (made design of
    Superscalar Architecture easier than for Intels
    CISC platform)
  • First MIPS processor with a Superscalar
    Architecture was the MIPS R8000 64 bit, released
    in 1994.

13
MIPS R8000 Processor
  • R8000 Chip Set Diagram
  • Courtesy of Silicon Graphics http//sgi.cartsys.ne
    t/i2sec7.html

14
MIPS R8000 Features
  • Superscalar
  • Can support/process 4 in-order instructions each
    cycle
  • Multi-component chip set (Integer Unit, Floating
    Point Unit, Tag RAMs and Data Streaming Cache)
  • Designed for peak performance with Floating Point
    Operations

15
MIPS R8000 Pitfalls
  • Integer operation performance limited
  • Very high cost
  • As a result of these two key factors
  • The R8000 was only in the marketplace for about a
    year.
  • This processor was mainly used only in the
    scientific community

16
MIPS R10000 Processor
Superscalar Pipeline Architecture for the R10000
processor. Diagram courtesy of R10000
Microprocessor Users Manual. http//techpubs.sgi.
com/library/dynaweb_docs/hdwr/SGI_Developer/books/
R10K_UM/sgi_html/t5.Ver.2.0.book_12.html
17
R10000 Processor - Features
  • Introduced in 1995
  • Improved integer instruction performance
  • Ability to create a multi-processor system (can
    attach up to 4 R10000 chips together)
  • Fetches and decodes 4 instructions each clock
    cycle/pipeline stage
  • Out Of Order Instruction Execution First MIPS
    Processor to support this feature

18
R10000 Block Diagram
Each decoded instruction is sent to one of 3
instruction queues -Address Queue (Load/Store
Instructions) -Integer Queue (Integer ALU
Operations) -Floating Point Queue (Floating
Point Arithmetic Operations)
19
MIPS R10000 Processor
  • 5 Execution Pipelines
  • - Load/Store Unit
  • - Two Integer ALUs
  • - Floating Point Adder
  • - Floating Point Multiplier
  • Can process up to 4 out of order instructions
    simultaneously
  • Base architecture core that all successor MIPS
    processors have been built from

20
PowerPC
  • Direct descendent of IBM 801, RT PC and RS/6000
  • All are RISC
  • RS/6000 first superscalar
  • PowerPC 601 superscalar design similar to RS/6000
  • Later versions extend superscalar concept

21
PowerPC 601 Pipeline Structure
22
PowerPC 601 Pipeline
23
PowerPC 601 General View
24
PowerPC storage model
  • Supports for byte(8-bits), halfword(16-bits),
    word(32-bits) and doubleword(64-bits) data types.
  • Handles string operations for multi-byte strings
    up to 128 bytes
  • 32-bit PowerPC implementations supports a 4-GB
    effective address space.
  • 64-bits PowerPC implementations supports a
    16-exabyte effiective address space.

25
General-purpose registers (GPR)
  • User Instruction Set architecture specifies all
    implementations have 32 GPRs
  • GPRs are the source and destination of all
    integer operations
  • No lookup is done for GPR0s contents.

26
Floating-point registers (FPR)
  • All implementations have 32 FPRs.
  • FPR are source and destination operands of all
    floating-point operations.
  • Contains 32-bit and 64-bit signed and unsigned
    integer vlaues, single-precision and
    double-precision floating-point values.

27
Special-purpose registers (SPR)
  • Give status and control of resources within the
    processor core.
  • Read and written by applications without support
    from a system service include the Count Register,
    the Link Register and the Integer Exception
    Register.
  • Can only be ready by applications with support
    form a system service include the Time Base and
    other timers.

28
T1000 Architectures
  • The T1000 Architectures are reconfigurable
    computing architectures embedded into a
    superscalar
  • T1000 Architectures rely on the programmable
    functional unit ( PFU ), integrated into the
    datapath.
  • T1000 is assumed to be a 4-issue out-of-order
    machine. It helps tolerate the latencies of some
    data dependent instruction sequences.
  • T1000 extended instruction is encoded as a
    register-register operation with a specific
    opcode.

29
Hobbes
  • A multi-threaded architecture attempt to increase
    pipeline utilization by concurrently executing
    instructions from different threads.
  • The architecture chosen was the aggressive
    speculative and out-of-order superscalar
    processor based on the MIPS R2000 instruction
    set.
  • The Hobbes architecture combines multi-threading
    with superscalar issue, with the supposition that
    strengths of one should offset the weaknesses of
    the other.
  • By supporting superscalar issue from more than
    one thread, the architecture overcomes the lack
    of instruction-level parallelism that plagues
    other superscalar structures.

30
Background
  • The Hobbes micro-architecture draws its
    inspiration from two widely differing
    architectures Multi-threaded and superscalar.
  • It is hoped that the combined of the fundamental
    concepts of these architecture will build upon
    their respective strengths and compensate for
    their corresponding weaknesses, allowing a hybrid
    to be greater than the sum of its parts.

31
Multi-threaded Architectures
  • Multi-threaded processors can concurrently
    execute instructions from more than one thread.
  • The contexts of multiple threads are stored
    on-board, which allows instructions to be issued
    from different threads.
  • Traditional multi-threaded architectures have
    usually implemented a round-robin execution
    strategy with switched that instruction execution
    to a new a thread every cycle.

32
The Thread Unit of Hobbes
  • The Thread unit contains all of the elements
    required to support a single thread.
  • It consists of a fetch buffer, issue buffer,
    decode logic, branch adder and the thread state
    storage.

33
The Thread Unit
  • Instruction fetch is performed by reading an
    entire cacheline of four words and storing it in
    the fetch buffer.
  • Each thread decodes and issues its instructions
    in program order. After and instruction has been
    decoded, it is stalled until all of its operands
    are available.
  • Once the operands are ready, the instruction is
    placed into the issue buffer and the issue unit
    is notified.
  • The register file is very similar to that found
    on the R2000. The register file has two write
    ports and both of these may be from the same
    thread.
  • Branches which do not affect the register file
    are executed in the thread unit and are not
    issued to the execution unit.

34
The Execution Units of Hobbes
  • The Hobbes architecture has an almost identical
    set of execution units as out-of order
    superscalar processor.
  • The characteristics of the execution units
    approximately correspond to those of the
    R2000/R2010.
  • Execution Units
  • Integer 2 ALUs, Shifter, Multiply / Divide, Load
    / Store, Data cache interface
  • FP FP Convert, FP Add, FP Multiply, FP Divide

35
Superscalar Architecture
  • Superscalar processors improve performance by
    reducing the average number of cycles required to
    execute each instruction
  • This is accomplished by issuing and executing
    more than one independent instruction per cycle,
    rather than limiting execution to just on
    instruction per cycle as traditional pipelined
    architectures.
  • For superscalar architectures to experience
    speed-up over traditional pipelined architectures
    they require the average level of available
    instruction-level parallelism to be greater than
    one.

36
References
  • Hennessy, John L and Patterson, David A.
    Computer Organization and Design, The
    Hardware/Software Interface. San Francisco
    Morgan Kaufmann Publishers 1998.
  • Sarimento, Sara. Recent History of Intel
    Architecture A Refresher. 17 April 2004.
    Intel Corporation www.intel.com 18 April 2004
    http//www.intel.com/cd/ids/developer/asmo-na/eng/
    microprocessors/ia32/pentium4/optimization/44015.h
    tm
  • Zhou Martonosi. Augmenting Modern Suuperscalar
    Architectures with Configurable Extended
    Instructions. 19 April 2004. http//ipdps.eece.un
    m.edu/2000/raw/18000943.pdf
  • Kish Preiss. Hobbes A Multi-Threaded
    Superscalar Architecture 19, April 2004
    http//www.brpreiss.com/page75.html
  • R10000 Processor Users Manual. 9 Dec 1996. SGI
    Corporation. 22 April 2004 http//techpubs.sgi.co
    m/library/dynaweb_docs/hdwr/SGI_Developer/books/R1
    0K_UM/sgi_html/index.htmlHEADING1
  • MIPS Architecture. 17 April 2004. Wikipedia,
    The Free Encyclopedia http//en.wikipedia.org/wiki
    /Main_Page 23 April 2004 http//en.wikipedia.org/
    wiki/MIPS_architecture.
  • Mapleson, Ian. Indigo 2 and Power Indigo 2
    Technical Report. SiliconGraphics. 23 April
    2004 http//sgi.cartsys.net/i2sec7.html.
  • Power PC Architecture 23 April 2004
    http//www-1.ibm.com/servers/eserver/pseries/hardw
    are/whitepapers/power/ppc_arch.html
Write a Comment
User Comments (0)
About PowerShow.com