Title: Superscalar Pipeline Architectures
1Superscalar Pipeline Architectures
- By Matthew Osborne, Philip Ho, Xun Chen
- April 19, 2004
2Superscalar Architecture
- Relatively new, first appeared in early 1990s
- Builds on the concept of pipelining
- Superscalar architectures can process multiple
instructions in one clock cycle (multiple
instruction execution units) - Allows for instruction execution rate to exceed
the clock rate (CPI of less than 1)
3Overview of Selected Superscalar Architectures
- Intel
- MIPS
- PowerPC
- T 1000 Architectures
- Hobbes A Multi-threaded superscalar
4Intel Superscalar Architecture
- According to Sara Sarimento, in her essay Recent
History of Intel Architecture A Refresher - - Intels first use of a superscalar
architecture was its Pentium Processor - - Instruction Level Parallelism - instructions
independent of the outcome of one another execute
concurrently to utilize more of the available
hardware resources and increase instruction
throughput.
5Intel P5 Microarchitecture
- Used in initial Pentium processor
- Could execute up to 2 instructions simultaneously
- Instructions sent through the pipeline in order -
if the next two instructions had a dependency
issue, only one instruction (pipe) would be
executed and the second execution unit (pipe)
went unused for that clock cycle.
6Intel P6 Microarchitecture
- - Used in the Pentium II, III and Pro processors
- 3 instruction decoders, which break each CISC
instruction (macro-op) into equivalent
micro-operations (µops) for the Out-of-Order
Execution unit - 10 stage instruction pipeline utilized in this
architecture
7Intel P6 Microarchitecture
- Out of Order instruction execution - executes
instructions without data dependency issues out
of order for a higher level of hardware
utilization - Scheduler unit resolves data dependency issues
between individual instructions - Re-Order Buffer puts instructions back in order
before writing them back to memory - Up to 3 instructions can be retired concurrently
to memory
8Intel NetBurst MicroArchitecture
- New architecture used for the Intel Pentium IV
and Pentium Xeon processors
9Intel NetBurst Microarchitecture
- Changes from P6 Architecture
- Only one instruction decoder present
- Decoder moved outside the Out-of-Order Execution
Unit an Execution Trace Cache was added in its
place - Increased number of pipeline stages to 20
- Improved branch prediction algorithms
- ALUs operate twice as quickly as their P6
counterparts
10Intel NetBurst Microarchitecture
- Execution Trace Cache
- Alleviates delays in fetching and translating
CISC instructions to their appropriate µops - Instructions are now decoded by a translation
engine, with the resulting µops stored as traces
(sequence of µops) in the Execution trace cache. - Traces stored in path of predicted program
execution flow, with results of branches in the
code integrated into this path - Delivers up to 3 µops to the core of the
Execution Unit per clock cycle
11Intel NetBurst Microarchitecture
- Branch Prediction
- Branch targets are predicted based on their
linear address using branch prediction logic and
fetched as soon as possible - Targets are fetched from the Execution Trace
Cache if cached there otherwise they are fetched
form the memory hierarchy - Downside despite the improved prediction
algorithm, one of the biggest costs of this
architecture is mispredicted branches because of
the longer instruction pipeline than previous
architectures.
12MIPS Superscalar Architecture
- MIPS is a RISC instruction platform, versus
Intels CISC instruction platform (made design of
Superscalar Architecture easier than for Intels
CISC platform) - First MIPS processor with a Superscalar
Architecture was the MIPS R8000 64 bit, released
in 1994.
13MIPS R8000 Processor
- R8000 Chip Set Diagram
- Courtesy of Silicon Graphics http//sgi.cartsys.ne
t/i2sec7.html
14MIPS R8000 Features
- Superscalar
- Can support/process 4 in-order instructions each
cycle - Multi-component chip set (Integer Unit, Floating
Point Unit, Tag RAMs and Data Streaming Cache) - Designed for peak performance with Floating Point
Operations
15MIPS R8000 Pitfalls
- Integer operation performance limited
- Very high cost
- As a result of these two key factors
- The R8000 was only in the marketplace for about a
year. - This processor was mainly used only in the
scientific community
16MIPS R10000 Processor
Superscalar Pipeline Architecture for the R10000
processor. Diagram courtesy of R10000
Microprocessor Users Manual. http//techpubs.sgi.
com/library/dynaweb_docs/hdwr/SGI_Developer/books/
R10K_UM/sgi_html/t5.Ver.2.0.book_12.html
17R10000 Processor - Features
- Introduced in 1995
- Improved integer instruction performance
- Ability to create a multi-processor system (can
attach up to 4 R10000 chips together) - Fetches and decodes 4 instructions each clock
cycle/pipeline stage - Out Of Order Instruction Execution First MIPS
Processor to support this feature
18 R10000 Block Diagram
Each decoded instruction is sent to one of 3
instruction queues -Address Queue (Load/Store
Instructions) -Integer Queue (Integer ALU
Operations) -Floating Point Queue (Floating
Point Arithmetic Operations)
19MIPS R10000 Processor
- 5 Execution Pipelines
- - Load/Store Unit
- - Two Integer ALUs
- - Floating Point Adder
- - Floating Point Multiplier
- Can process up to 4 out of order instructions
simultaneously - Base architecture core that all successor MIPS
processors have been built from
20PowerPC
- Direct descendent of IBM 801, RT PC and RS/6000
- All are RISC
- RS/6000 first superscalar
- PowerPC 601 superscalar design similar to RS/6000
- Later versions extend superscalar concept
21PowerPC 601 Pipeline Structure
22PowerPC 601 Pipeline
23PowerPC 601 General View
24PowerPC storage model
- Supports for byte(8-bits), halfword(16-bits),
word(32-bits) and doubleword(64-bits) data types. - Handles string operations for multi-byte strings
up to 128 bytes - 32-bit PowerPC implementations supports a 4-GB
effective address space. - 64-bits PowerPC implementations supports a
16-exabyte effiective address space.
25General-purpose registers (GPR)
- User Instruction Set architecture specifies all
implementations have 32 GPRs - GPRs are the source and destination of all
integer operations - No lookup is done for GPR0s contents.
26Floating-point registers (FPR)
- All implementations have 32 FPRs.
- FPR are source and destination operands of all
floating-point operations. - Contains 32-bit and 64-bit signed and unsigned
integer vlaues, single-precision and
double-precision floating-point values.
27Special-purpose registers (SPR)
- Give status and control of resources within the
processor core. - Read and written by applications without support
from a system service include the Count Register,
the Link Register and the Integer Exception
Register. - Can only be ready by applications with support
form a system service include the Time Base and
other timers.
28T1000 Architectures
- The T1000 Architectures are reconfigurable
computing architectures embedded into a
superscalar - T1000 Architectures rely on the programmable
functional unit ( PFU ), integrated into the
datapath. - T1000 is assumed to be a 4-issue out-of-order
machine. It helps tolerate the latencies of some
data dependent instruction sequences. - T1000 extended instruction is encoded as a
register-register operation with a specific
opcode.
29Hobbes
- A multi-threaded architecture attempt to increase
pipeline utilization by concurrently executing
instructions from different threads. - The architecture chosen was the aggressive
speculative and out-of-order superscalar
processor based on the MIPS R2000 instruction
set. - The Hobbes architecture combines multi-threading
with superscalar issue, with the supposition that
strengths of one should offset the weaknesses of
the other. - By supporting superscalar issue from more than
one thread, the architecture overcomes the lack
of instruction-level parallelism that plagues
other superscalar structures.
30Background
- The Hobbes micro-architecture draws its
inspiration from two widely differing
architectures Multi-threaded and superscalar. - It is hoped that the combined of the fundamental
concepts of these architecture will build upon
their respective strengths and compensate for
their corresponding weaknesses, allowing a hybrid
to be greater than the sum of its parts.
31Multi-threaded Architectures
- Multi-threaded processors can concurrently
execute instructions from more than one thread. - The contexts of multiple threads are stored
on-board, which allows instructions to be issued
from different threads. - Traditional multi-threaded architectures have
usually implemented a round-robin execution
strategy with switched that instruction execution
to a new a thread every cycle.
32The Thread Unit of Hobbes
- The Thread unit contains all of the elements
required to support a single thread. - It consists of a fetch buffer, issue buffer,
decode logic, branch adder and the thread state
storage.
33The Thread Unit
- Instruction fetch is performed by reading an
entire cacheline of four words and storing it in
the fetch buffer. - Each thread decodes and issues its instructions
in program order. After and instruction has been
decoded, it is stalled until all of its operands
are available. - Once the operands are ready, the instruction is
placed into the issue buffer and the issue unit
is notified.
- The register file is very similar to that found
on the R2000. The register file has two write
ports and both of these may be from the same
thread. - Branches which do not affect the register file
are executed in the thread unit and are not
issued to the execution unit.
34The Execution Units of Hobbes
- The Hobbes architecture has an almost identical
set of execution units as out-of order
superscalar processor. - The characteristics of the execution units
approximately correspond to those of the
R2000/R2010.
- Execution Units
- Integer 2 ALUs, Shifter, Multiply / Divide, Load
/ Store, Data cache interface - FP FP Convert, FP Add, FP Multiply, FP Divide
35Superscalar Architecture
- Superscalar processors improve performance by
reducing the average number of cycles required to
execute each instruction - This is accomplished by issuing and executing
more than one independent instruction per cycle,
rather than limiting execution to just on
instruction per cycle as traditional pipelined
architectures. - For superscalar architectures to experience
speed-up over traditional pipelined architectures
they require the average level of available
instruction-level parallelism to be greater than
one.
36References
- Hennessy, John L and Patterson, David A.
Computer Organization and Design, The
Hardware/Software Interface. San Francisco
Morgan Kaufmann Publishers 1998. - Sarimento, Sara. Recent History of Intel
Architecture A Refresher. 17 April 2004.
Intel Corporation www.intel.com 18 April 2004
http//www.intel.com/cd/ids/developer/asmo-na/eng/
microprocessors/ia32/pentium4/optimization/44015.h
tm - Zhou Martonosi. Augmenting Modern Suuperscalar
Architectures with Configurable Extended
Instructions. 19 April 2004. http//ipdps.eece.un
m.edu/2000/raw/18000943.pdf - Kish Preiss. Hobbes A Multi-Threaded
Superscalar Architecture 19, April 2004
http//www.brpreiss.com/page75.html - R10000 Processor Users Manual. 9 Dec 1996. SGI
Corporation. 22 April 2004 http//techpubs.sgi.co
m/library/dynaweb_docs/hdwr/SGI_Developer/books/R1
0K_UM/sgi_html/index.htmlHEADING1 - MIPS Architecture. 17 April 2004. Wikipedia,
The Free Encyclopedia http//en.wikipedia.org/wiki
/Main_Page 23 April 2004 http//en.wikipedia.org/
wiki/MIPS_architecture. - Mapleson, Ian. Indigo 2 and Power Indigo 2
Technical Report. SiliconGraphics. 23 April
2004 http//sgi.cartsys.net/i2sec7.html. - Power PC Architecture 23 April 2004
http//www-1.ibm.com/servers/eserver/pseries/hardw
are/whitepapers/power/ppc_arch.html -