Title: Chapter 14 Instruction Level Parallelism and Superscalar Processors
1Chapter 14Instruction Level Parallelismand
Superscalar Processors
2What is Superscalar?
A Superscalar machine executes multiple
independent instructions in parallel.
- Common instructions (arithmetic, load/store,
conditional branch) can be executed
independently. - Equally applicable to RISC CISC, but more
straightforward in RISC machines. - The order of execution is usually determined by
the compiler.
3Example Superscalar Organization
4Superpipelined Machines
Superpiplined machines overlap pipe stages -
rely on stages being able to begin operations
before the last is complete.
5Superscalar v Superpipeline
6Limitations of Superscalar
- Dependent upon
- Instruction level parallelism
- Compiler based optimization
- Hardware support
- Limited by
- True data dependency
- Procedural dependency
- Resource conflicts
- Output dependency
- Antidependency (one instruction can overwrite a
value that an earlier instruction has not yet
read)
7True Data Dependency
- ADD r1, r2 (r1 r1r2)
- MOVE r3, r1 (r3 r1)
- Can fetch and decode second instruction in
parallel with first - Can NOT execute second instruction until first is
finished - Compare with the following?
- LOAD r1, X (r1 X)
- MOVE r3, r1 (r3 r1)
- What additional problem do we have here?
8Procedural Dependency
- Can not execute instructions after a branch in
parallel with instructions before a branch,
because? - Also, if instruction length is not fixed,
instructions have to be decoded to find out how
many fetches are needed
9Resource Conflict
- Two or more instructions requiring access to the
same resource at the same time - e.g. two arithmetic instructions
- Solution - Can possibly duplicate resources
- e.g. have two arithmetic units
10Antidependancy
- ADD R4, R3, 1
- ADD R3, R5, 1
- Cannot complete the second instruction before the
first has read R3
11Effect of Dependencies
12Instruction-level Parallelism
- Consider
- LOAD R1, R2
- ADD R3, 1
- ADD R4, R2
- These can be handles in parallel
- Consider
- ADD R3, 1
- ADD R4, R3
- STO (R4), R0
- These cannot
13Instruction Issue Policies
- Order in which instructions are fetched
- Order in which instructions are executed
- Order in which instructions change registers and
memory
14In-Order Issue In-Order Completion
- Issue instructions in the order they occur
- Not very efficient
- May fetch gt1 instruction
- Instructions must stall if necessary
15In-Order Issue In-Order Completion (Diagram)
- Assume
- I1 requires 2 cycles to execute
- I3 I4 conflict for the same functional unit
- I5 depends upon value produced by I4
- I5 I6 conflict for a functional unit
16In-Order Issue Out-of-Order Completion
How does this effect interrupts?
17Out-of-Order IssueOut-of-Order Completion
- Decouple decode pipeline from execution pipeline
- Can continue to fetch and decode until this
pipeline is full - When a functional unit becomes available an
instruction can be executed - Since instructions have been decoded, processor
can look ahead
18Out-of-Order Issue Out-of-Order Completion
(Diagram)
Note I5 depends upon I4, but I6 does not
19Register Renaming
- Output and antidependencies occur because
- register contents may not reflect the correct
- ordering from the program
- Could result in a pipeline stall
- One solution Registers allocated dynamically
20Register Renaming example
- R3bR3a R5a (I1)
- R4bR3b 1 (I2)
- R3cR5a 1 (I3)
- R7bR3c R4b (I4)
- Without subscript refers to logical register in
instruction - With subscript is hardware register allocated
- Note R3a R3b R3c
21Machine Parallelism Support
- Duplication of Resources
- Out of order issue
- Renaming
- Windowing
22Speedups of Machine Organizations Without
Procedural Dependencies
23Study Conclusions
- Not worth duplication functional units without
register renaming - Need instruction window large enough (more than
8, probably not more than 32)
24Branch Prediction in Superscalar Machines
- Delayed branch not used much. Why?
- Branch prediction used - Branch history may still
be useful
25Superscalar Execution
26Committing or Retiring Instructions
- Sometimes results must be held in temporary
storage until it is certain they can be placed in
permanent storage. - This temporary storage needs to be regularly
cleaned up.
27Superscalar Hardware Support
- Simultaneously fetch multiple instructions
- Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values - Mechanisms to initiate multiple instructions in
parallel - Resources for parallel execution of multiple
instructions - Mechanisms for committing process state in
correct order