Title: William Stallings Computer Organization and Architecture
1William Stallings Computer Organization and
Architecture
- Chapter 14
- Instruction Level Parallelism
- and Superscalar Processors
2Topics
- Overview
- Design Issues
- Pentium 4 and PowerPC
3Overview - What is Superscalar?
- Originally refers to a machine designed to
improve the performance of execution of scalar
instructions. (Agerwala, 1987) - Common instructions (arithmetic, load/store,
conditional branch) can be initiated and executed
independently - Equally applicable to RISC CISC
- In practice usually RISC
- Has become the standard method for implementing
high-performance microprocessors
4Why Superscalar?
- Most operations are on scalar quantities (see
RISC notes) - Improve these operations to get an overall
improvement
5Idea...
- If we could have more than one pipelines
- Ability to execute instructions independently in
different pipelines - Instructions can be executed in an order
different from the program order - e.g., a a 2 b c d
- b c d a a 2
- Do they give the same result?
- Degree of parallelism goes up as more
instructions are executed in parallel
6General Superscalar Organization
- 2 integer, 2 FP, and 1 memory operations can be
executed at the same time
7Superpipelined
- Many pipeline stages need less than half a clock
cycle - Double internal clock speed gets two tasks per
external clock cycle - Degree of superpipelining
- Number of substages
- Speed goes up as degree of superpipelining
- Superscalar allows parallel fetch execute
8Superscalar vsSuperpipeline
9Example
- 6 instructions, 4 stages
- No pipelining ___ time units
- Basic pipelining ___ time units
- Degree of superpipelining 2 ___ time units
- Degree of superscalar 2 ___ time units
- Try generalizing it w/ n instructions and k
stages
10Limitations of Superscalar Approach
- Instruction level parallelism
- Degree of of instructions that can be executed in
parallel - Can be maximized by
- Compiler-based optimisation
- Hardware techniques
- Limited by
- True data dependency
- Procedural dependency
- Resource conflicts
- Output dependency
- Antidependency
11True Data Dependency
- Example
- ADD r1, r2 //r1 ? r1r2
- MOVE r3, r1 //r3 ? r1
- Can fetch and decode second instruction in
parallel with first - Can NOT execute second instruction until first is
finished - Flow dependency or write-read dependency
12Procedural Dependency
- Can not execute instructions after a branch in
parallel with instructions before a branch - Also, if instruction length is not fixed,
instructions have to be decoded to find out how
many fetches are needed - This prevents simultaneous fetches
13Resource Conflict
- Two or more instructions requiring access to the
same resource at the same time - e.g. two arithmetic instructions
- Can duplicate resources
- e.g. have two arithmetic units
14Dependencies
15Dependencies - Examples
- Example 1
- LOAD R1 ? R2
- ADD R3 ? R3, 1 //1 immediate mode
- ADD R4 ? R4, R2
- Degree of parallelism __
- Example 2
- ADD R3 ? R3, 1
- ADD R4 ? R3, R2
- STORE R4 ? R0 //R4 register indirect
- Degree of parallelism __
MM
R0
R4
16Design Issues
- Instruction level parallelism
- Instructions in a sequence are independent
- Execution can be overlapped
- Governed by data and procedural dependency
- Machine Parallelism
- Ability to take advantage of instruction level
parallelism - Governed by number of parallel pipelines, i.e.,
number of instructions that can be fetched and
executed at a time
17Instruction Issue Policy (1)
- Instruction issue
- Process of initiating instruction execution in
the processors functional units - Instruction-issue policy
- Protocol used to issue instructions
- Processor tries to look ahead to locate
instructions that can be brought into pipeline
and executed
18Instruction Issue Policy (2)
- Important orderings in which
- instructions are fetched
- instructions are executed
- instructions change registers and memory
- Order(s) changed to optimize performance
- Constraint result must be CORRECT!
- Three categories of instruction-issue policies
- In-order issue, in-order completion
- In-order issue, out-of-order completion
- Out-of-order issue, out-of-order completion
19In-Order Issue In-Order Completion
- Issue instructions in the order they occur
- Not very efficient
- May fetch gt1 instruction
- Instructions must stall if necessary
20In-Order Issue In-Order Completion (Diagram)
Pipeline
Time
Takes 2 cycles
F.D.
D.D.
F.D.
3 Functional Units
Assumptions
21In-Order Issue Out-of-Order Completion (1)
- Any number of instructions may be in the
execution stage at a time - Up to maximum degree of machine parallelism
- Instruction issuing is stalled by resource
conflicts, data or procedural dependencies - Output dependency must be solved
22In-Order Issue Out-of-Order Completion (2)
- Output dependency - Example
- R3 ? R3 R5 (I1)
- R4 ? R3 1 (I2)
- R3 ? R5 1 (I3)
- I2 depends on result of I1 - data dependency
- If I3 completes before I1, the result from I1
will be wrong ? output dependency - Write-write dependency
23In-Order Issue Out-of-Order Completion (Diagram)
Pipeline
Time
24Out-of-Order IssueOut-of-Order Completion
- Decouple decode from execution by a buffer
(instruction window) that stores decoded
instruction - Can continue to fetch and decode until this
buffer is full - When a functional unit becomes available an
instruction can be executed - Since instructions have been decoded, processor
can look ahead
25Out-of-Order Issue Out-of-Order Completion
(Diagram)
Pipeline
26Antidependency
- Read-write dependency
- R3 ? R3 R5 (I1)
- R4 ? R3 1 (I2)
- R3 ? R5 1 (I3)
- R7 ? R3 R4 (I4)
- I3 cannot complete before I2 starts as I2 needs a
value in R3 and I3 changes R3
27Register Renaming (1)
- Output dependencies and antidependencies occur
because register contents may not reflect the
correct ordering from the program - May result in a pipeline stall
- Solution allocate registers dynamically
- i.e. registers are not specifically named
28Register Renaming (2)
- By processor hardware
- Associated with values needed by instructions at
various points in time - When a new register value is created a new
register is allocated for that - Subsequent instructions that accessing that value
as a source must do renaming
29Register Renaming example
- I1 R3 ? R3 R5 I1 R3b ? R3a R5a
- I2 R4 ? R3 1 I2 R4a ? R3b 1
- I3 R3 ? R5 1 I3 R3c ? R5a 1
- I4 R7 ? R3 R4 I4 R7b ? R3c R4a
- Q where are the dependencies?
- Without subscript refers to logical register in
instruction - With subscript is hardware register allocated
- Write-write and read-write dependencies are gone!
- Q Can we get rid of write-read dependencies?
30Machine Parallelism
- Three hardware techniques to enhance performance
- Duplication of Resources
- Out of order issue
- Renaming
- Not worth duplication functions without register
renaming - Need instruction window large enough (more than 8)
31Branch Prediction
- 80486 fetches both next sequential instruction
after branch and branch target instruction - ? Gives two cycle delay if branch taken
- Pre-RISC technique
32RISC - Delayed Branch
- Calculate result of branch before unusable
instructions pre-fetched - ? Always execute single instruction immediately
following branch - ? Keeps pipeline full while fetching new
instruction stream - Not as good for superscalar, as
- Multiple instructions need to execute in delay
slot - ? Instruction dependence problems
- Revert to branch prediction
33Superscalar Execution
34Superscalar Implementation
- Simultaneously fetch multiple instructions
- Logic to determine true dependencies involving
register values - Mechanisms to communicate these values
- Mechanisms to initiate multiple instructions in
parallel - Resources for parallel execution of multiple
instructions - Mechanisms for committing process state in
correct order
35Pentium 4
- 80486 - CISC
- Pentium some superscalar components
- Two separate integer execution units
- Pentium Pro Full blown superscalar
- Subsequent models refine enhance superscalar
design
36Pentium 4 Block Diagram
37Pentium 4 Operation
- Fetch instructions form memory in order of static
program - Translate instruction into one or more fixed
length RISC instructions (micro-operations) - Execute micro-ops on superscalar pipeline
- micro-ops may be executed out of order
- Commit results of micro-ops to register set in
original program flow order - Outer CISC shell with inner RISC core
- Inner RISC core pipeline at least 20 stages
- Some micro-ops require multiple execution stages
- Longer pipeline
- c.f. five stage pipeline on x86 up to Pentium
38Pentium 4 Pipeline
39Pentium 4 Pipeline Operation (1)
40Pentium 4 Pipeline Operation (2)
41Pentium 4 Pipeline Operation (3)
42Pentium 4 Pipeline Operation (4)
43Pentium 4 Pipeline Operation (5)
44Pentium 4 Pipeline Operation (6)
45PowerPC
- Direct descendent of IBM 801, RT PC and RS/6000
- All are RISC
- RS/6000 first superscalar
- PowerPC 601 superscalar design similar to RS/6000
- Later versions extend superscalar concept
46PowerPC 601 General View
47PowerPC 601 Pipeline Structure
48PowerPC 601 Pipeline
49Required Reading
- Stallings chapter 14
- Manufacturers web sites
- IMPACT web site
- research on predicated execution