Title: 14 Superscalar Processors
1Chapter 14
Superscalar Processors
2What is Superscalar?
A Superscalar machine executes multiple
independent instructions in parallel.
- Common instructions (arithmetic, load/store,
conditional branch) can be executed
independently. - Equally applicable to RISC CISC, but more
straightforward in RISC machines. - The order of execution is usually determined by
the compiler.
3Example Superscalar Organization
- 2 Integer ALU pipelines,
- 2 FP ALU pipelines,
- 1 memory pipeline (?)
4Superpipelined Machines
Superpiplined machines overlap pipe stages -
rely on stages being able to begin operations
before the last is complete. Superscaler
machines have multiple instruction pipelines -
process multiple instructions in parallel
5Superscalar v Superpipeline
6Limitations of Superscalar
- Dependent upon
- Instruction level parallelism
- Compiler based optimization
- Hardware support
- Limited by
- True Data dependency
- Procedural dependency
- Resource conflicts
- Output dependency or
- Antidependency (another form of data
dependency)
7True Data Dependency
- ADD r1, r2 (r1r2 ? r1)
- MOVE r3, r1 (r1 ? r3)
- Can fetch and decode second instruction in
parallel with first - Can NOT execute second instruction until first is
finished - Compare with the following?
- LOAD r1, X (x ? r1)
- MOVE r3, r1 (r1? r3)
- What additional problem do we have here?
8Procedural Dependency
- Cant execute instructions after a branch in
parallel with instructions before a branch,
because? - Note Also, if instruction length is not
fixed, instructions have to be decoded to find
out how many fetches are needed
9Resource Conflict
- Two or more instructions requiring access to the
same resource at the same time - e.g. two arithmetic instructions
- Solution - Can possibly duplicate resources
- e.g. have two arithmetic units
10Antidependancy
- ADD R4, R3, 1 R3 1 ? R4
- ADD R3, R5, 1 R5 1 ? R3
- Cannot complete the second instruction before the
first has read R3 - Why?
11True data dependency Antidependency
- True data dependency
- result of 1st instr used in 2nd instr
- (cant complete 1st too soon)
- Antidenpendency
- out of order completion of 2nd instr can
- write over value to be used in 1st instr
- (must complete 1st before 2nd changes
- operand value)
12Effect of Dependencies
13Instruction-level Parallelism
- Consider
- LOAD R1, R2
- ADD R3, 1
- ADD R4, R2
- These can be handled in parallel. Why?
- Consider
- ADD R3, 1
- ADD R4, R3
- STO (R4), R0
- These cannot. Why?
14Instruction Issue Policies
- Order in which instructions are fetched
- Order in which instructions are executed
- Order in which instructions update registers and
memory values - Note there is also the issue of instruction
completion policy
15In-Order Issue -- In-Order Completion
- Issue instructions in the order they occur
- Not very efficient
- Instructions must stall if necessary
16In-Order Issue -- In-Order Completion (Example)
- Assume
- I1 requires 2 cycles to execute
- I3 I4 conflict for the same functional unit
- I5 depends upon value produced by I4
- I5 I6 conflict for a functional unit
17In-Order Issue -- Out-of-Order Completion(Example
)
- Again
- I1 requires 2 cycles to execute
- I3 I4 conflict for the same functional unit
- I5 depends upon value produced by I4
- I5 I6 conflict for a functional unit
How does this effect interrupts?
18Out-of-Order Issue -- Out-of-Order Completion
- Decouple decode pipeline from execution pipeline
- Can continue to fetch and decode until this
pipeline is full - When a functional unit becomes available an
instruction can be executed - Since instructions have been decoded, processor
can look ahead
19Out-of-Order Issue -- Out-of-Order Completion
(Example)
- Again
- I1 requires 2 cycles to execute
- I3 I4 conflict for the same functional unit
- I5 depends upon value produced by I4
- I5 I6 conflict for a functional unit
Note I5 depends upon I4, but I6 does not
20Register Renaming
- Output and antidependencies occur because
- register contents may not reflect the correct
- ordering from the program
- Can result in a pipeline stall
- One solution Allocate Registers dynamically
- (renaming registers)
21Register Renaming example
- R3bR3a R5a (I1)
- R4bR3b 1 (I2)
- R3cR5a 1 (I3)
- R7bR3c R4b (I4)
- Without subscript refers to logical register in
instruction - With subscript is hardware register allocated
- R3a R3b R3c
- Note R3c avoids antidependency on I2
- output dependency I1
22Machine Parallelism Support
- Duplication of Resources
- Out of order issue
- Renaming
- Windowing
23Speedups of Machine Organizations (Without
Procedural Dependencies)
- Not worth duplication of functional units
without register renaming - Need instruction window large enough (more than
8, probably not more than 32)
24Branch Prediction in Superscalar Machines
- Delayed branch not used much. Why?
- Multiple instructions need to execute in
the delay slot. - This leads to much complexity in
recovery. - Branch prediction may be used - Branch history
MAY still be useful - Are there any alternatives ?
25Superscalar Execution
26Committing or Retiring Instructions
- Results need to be put into order (commit or
retire) - Results sometimes must be held in temporary
storage until it is certain they can be placed in
permanent storage. - (commit or retire)
- Temporary storage requires regular clean up -
overhead.
27Superscalar Hardware Support
- Facilities to simultaneously fetch multiple
instructions - Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values - Mechanisms to initiate multiple instructions in
parallel - Resources for parallel execution of multiple
instructions - Mechanisms for committing process state in
correct order
28Conclusions
- What are the relative benefits of
- Superscalar
- Superpipelining
29Superscalar CISC machines
- Can Superscalar design be applied to CISC
machines ?
30javax.comm
- Basically, javax.comm is no longer supported
on Windows (hasn't been since 2002), so we
switched to RxTx, which is nearly identical. /
According to - http//en.wikibooks.org/wiki/Serial_Programming
Serial_JavaRxTx, - "Converting a JavaComm Application to RxTx", all
that is required to convert a javacomm
application to an RxTx application is simply
changing the import statement import
javax.comm. to import gnu.io.
Everything else in the program can remain
exactly the same because the package gnu.io
apparently encompasses the same classes as
javax.comm. - Indeed, rxtx version of SimpleWrite is
identical to the javacomm version of SimpleWrite
except that it imports gnu.io. rather than
javax.comm.."
31Basic Concepts of the IA-64 Architecture
- Instruction level parallelism
- Explicit in machine instruction rather than
determined at run time by processor - Long or very long instruction words (LIW/VLIW)
- Fetch bigger chunks already preprocessed
- Branch predication (not the same as branch
prediction) - Go ahead and fetch decode instructions, but
keep track of them so the decision to issue
them, or not, can be practically made later - Speculative loading
- Go ahead and load data so it is ready when need,
and have a practical way to recover if
speculation proved wrong - Software Pipelining
- Allows multiple iterations of a loop to execute
in parallel