14 Superscalar Processors presentation

About This Presentation

Transcript and Presenter's Notes

Title: 14 Superscalar Processors

1
Chapter 14
Superscalar Processors

2
What is Superscalar?
A Superscalar machine executes multiple
independent instructions in parallel.

Common instructions (arithmetic, load/store,
conditional branch) can be executed
independently.
Equally applicable to RISC CISC, but more
straightforward in RISC machines.
The order of execution is usually determined by
the compiler.

3
Example Superscalar Organization

2 Integer ALU pipelines,
2 FP ALU pipelines,
1 memory pipeline (?)

4
Superpipelined Machines
Superpiplined machines overlap pipe stages -
rely on stages being able to begin operations
before the last is complete. Superscaler
machines have multiple instruction pipelines -
process multiple instructions in parallel
5
Superscalar v Superpipeline
6
Limitations of Superscalar

Dependent upon
Instruction level parallelism
Compiler based optimization
Hardware support
Limited by
True Data dependency
Procedural dependency
Resource conflicts
Output dependency or
Antidependency (another form of data
dependency)

7
True Data Dependency

ADD r1, r2 (r1r2 ? r1)
MOVE r3, r1 (r1 ? r3)
Can fetch and decode second instruction in
parallel with first
Can NOT execute second instruction until first is
finished
Compare with the following?
LOAD r1, X (x ? r1)
MOVE r3, r1 (r1? r3)
What additional problem do we have here?

8
Procedural Dependency

Cant execute instructions after a branch in
parallel with instructions before a branch,
because?
Note Also, if instruction length is not
fixed, instructions have to be decoded to find
out how many fetches are needed

9
Resource Conflict

Two or more instructions requiring access to the
same resource at the same time
e.g. two arithmetic instructions
Solution - Can possibly duplicate resources
e.g. have two arithmetic units

10
Antidependancy

ADD R4, R3, 1 R3 1 ? R4
ADD R3, R5, 1 R5 1 ? R3
Cannot complete the second instruction before the
first has read R3
Why?

11
True data dependency Antidependency

True data dependency
result of 1st instr used in 2nd instr
(cant complete 1st too soon)
Antidenpendency
out of order completion of 2nd instr can
write over value to be used in 1st instr
(must complete 1st before 2nd changes
operand value)

12
Effect of Dependencies
13
Instruction-level Parallelism

Consider
LOAD R1, R2
ADD R3, 1
ADD R4, R2
These can be handled in parallel. Why?
Consider
ADD R3, 1
ADD R4, R3
STO (R4), R0
These cannot. Why?

14
Instruction Issue Policies

Order in which instructions are fetched
Order in which instructions are executed
Order in which instructions update registers and
memory values
Note there is also the issue of instruction
completion policy

15
In-Order Issue -- In-Order Completion

Issue instructions in the order they occur
Not very efficient
Instructions must stall if necessary

16
In-Order Issue -- In-Order Completion (Example)

Assume
I1 requires 2 cycles to execute
I3 I4 conflict for the same functional unit
I5 depends upon value produced by I4
I5 I6 conflict for a functional unit

17
In-Order Issue -- Out-of-Order Completion(Example
)

Again
I1 requires 2 cycles to execute
I3 I4 conflict for the same functional unit
I5 depends upon value produced by I4
I5 I6 conflict for a functional unit

How does this effect interrupts?
18
Out-of-Order Issue -- Out-of-Order Completion

Decouple decode pipeline from execution pipeline
Can continue to fetch and decode until this
pipeline is full
When a functional unit becomes available an
instruction can be executed
Since instructions have been decoded, processor
can look ahead

19
Out-of-Order Issue -- Out-of-Order Completion
(Example)

Again
I1 requires 2 cycles to execute
I3 I4 conflict for the same functional unit
I5 depends upon value produced by I4
I5 I6 conflict for a functional unit

Note I5 depends upon I4, but I6 does not
20
Register Renaming

Output and antidependencies occur because
register contents may not reflect the correct
ordering from the program
Can result in a pipeline stall
One solution Allocate Registers dynamically
(renaming registers)

21
Register Renaming example

R3bR3a R5a (I1)
R4bR3b 1 (I2)
R3cR5a 1 (I3)
R7bR3c R4b (I4)
Without subscript refers to logical register in
instruction
With subscript is hardware register allocated
R3a R3b R3c
Note R3c avoids antidependency on I2
output dependency I1

22
Machine Parallelism Support

Duplication of Resources
Out of order issue
Renaming
Windowing

23
Speedups of Machine Organizations (Without
Procedural Dependencies)

Not worth duplication of functional units
without register renaming
Need instruction window large enough (more than
8, probably not more than 32)

24
Branch Prediction in Superscalar Machines

Delayed branch not used much. Why?
Multiple instructions need to execute in
the delay slot.
This leads to much complexity in
recovery.
Branch prediction may be used - Branch history
MAY still be useful
Are there any alternatives ?

25
Superscalar Execution
26
Committing or Retiring Instructions

Results need to be put into order (commit or
retire)
Results sometimes must be held in temporary
storage until it is certain they can be placed in
permanent storage.
(commit or retire)
Temporary storage requires regular clean up -
overhead.

27
Superscalar Hardware Support

Facilities to simultaneously fetch multiple
instructions
Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values
Mechanisms to initiate multiple instructions in
parallel
Resources for parallel execution of multiple
instructions
Mechanisms for committing process state in
correct order

28
Conclusions

What are the relative benefits of
Superscalar
Superpipelining

29
Superscalar CISC machines

Can Superscalar design be applied to CISC
machines ?

30
javax.comm

Basically, javax.comm is no longer supported
on Windows (hasn't been since 2002), so we
switched to RxTx, which is nearly identical. /
According to
http//en.wikibooks.org/wiki/Serial_Programming
Serial_JavaRxTx,
"Converting a JavaComm Application to RxTx", all
that is required to convert a javacomm
application to an RxTx application is simply
changing the import statement import
javax.comm. to import gnu.io.
Everything else in the program can remain
exactly the same because the package gnu.io
apparently encompasses the same classes as
javax.comm.
Indeed, rxtx version of SimpleWrite is
identical to the javacomm version of SimpleWrite
except that it imports gnu.io. rather than
javax.comm.."

31
Basic Concepts of the IA-64 Architecture

Instruction level parallelism
Explicit in machine instruction rather than
determined at run time by processor
Long or very long instruction words (LIW/VLIW)
Fetch bigger chunks already preprocessed
Branch predication (not the same as branch
prediction)
Go ahead and fetch decode instructions, but
keep track of them so the decision to issue
them, or not, can be practically made later
Speculative loading
Go ahead and load data so it is ready when need,
and have a practical way to recover if
speculation proved wrong
Software Pipelining
Allows multiple iterations of a loop to execute
in parallel

Write a Comment

User Comments (0)

About PowerShow.com

14 Superscalar Processors PowerPoint PPT Presentation