Computer System Architecture Dependency and OOO Execution - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Computer System Architecture Dependency and OOO Execution

Description:

Changing the order of instructions to reduce the number of stall cycles. VLIW processors ... Pipeline does not stall on WAW and WAR hazards. Data forwarding ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 23

Provided by: SMI107

Category:

more less

Transcript and Presenter's Notes

Title: Computer System Architecture Dependency and OOO Execution

1
Computer System ArchitectureDependency and OOO
Execution

Lynn Choi
School of Electrical Engineering

2
Instruction-Level Parallelism

ILP (Instruction Level Parallelism)
The program characteristics that allows the
overlapped or parallel execution of instructions
Data dependences and control dependences limit
the ILP
As long as instruction A is (data and control)
independent with instruction B, A and B can be
executed in parallel
Processors exploit ILP to improve performance
Two approaches
Hardware approach rely on hardware to discover
and exploit the parallelism dynamically at
runtime
Pipelining overlapping the execution of
instructions in different pipeline stages
Out-of-order execution
Superscalar processors
Software approach rely on compiler to find and
expose the parallelism at compile time
Code scheduling (local and global scheduling)
Loop unrolling, software pipelining, trace
scheduling
Changing the order of instructions to reduce the
number of stall cycles
VLIW processors

3
Three Forms of Data Dependence

True dependence (Read-After-Write)
Also called flow dependence
Require pipeline interlock
Data bypass (forwarding) can reduce the producer
latency
Make values generated by FUs immediately
available
Output dependence (Write-After-Write)
Anti dependence (Write-After-Read)
Both of them are called false dependencies
Require pipeline interlock or register renaming

4
Control Dependence

A control dependence determines the ordering of
an instruction i with respect to a branch
instruction
Every instruction, except for those in the basic
block of the program, is control dependent on
some set of branches
Example
If p1
S1
If p2
S2
S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.
Control dependence impose the following two
constraints
An instruction that is control dependent on a
branch cannot be moved before the branch
An instruction that is not control dependent on a
branch cannot be moved after the branch

5
In-Order Pipeline

In-order issue
If an instruction is stalled in the pipeline, no
later instructions can proceed. However, once
issued to FUs, in general the instruction need
not be stalled.
Instruction can complete out-of-order
Dependency resolution mechanism
Pipeline interlock
Need reg-id comparators between sources and
destinations of instructions in REG stage and the
destinations of instructions in the EXE and WRB
stages
Comparators needed for both interlock and bypass
Scoreboard
A busy bit for each register
For long latency operations such as MEM
operations
Instead of comparators, you need to check
scoreboard for operand availability
Comparators are still needed for bypass!

6
Example

FET-DEC-REG-EXE-WRB
What kind of dependence violations are possible?
Single-issue 5-stage in-order pipeline with the
following pipelined FUs
2 INT unit (1 cycle INT operation)
1 FP unit (4 cycle FP operation)
2 MEM pipelines (2 cycle MEM operation)
How many comparators do you need for the previous
example?
RAW
2 srcs 2 stages (E, W) 2 INT 8
2 srcs 2 stages (E4, W) 1 FP 4
2 srcs 2 stages (E2, W) 2 MEM 8
WAW
1 dest 2 stages (E, W) 2 INT 4
1 dest 2 stages (E4, W) 1 FP 2
1 dest 2 stages (E2, W) 2 MEM 4
WAW hazard can happen only for MEM and FP
pipelines, therefore, you can remove 4 WAW
comparators for INT pipeline.
How many more comparators for 2-issue in-order
superscalar pipeline?

7
Out-Of-Order Machines

Anti-dependence can happen in OOO machines
DIV F0, F2, F4
ADD F10, F0, F8
SUB F8, F8, F14
Different approaches
Scoreboarding
Tomasulos Algorithm
Register Update Unit

8
Scoreboarding - CDC6600 -

Scoreboard
One bit per register indicates whether or not
there is a pending update
pipeline stalls on WAW and WAR dependences
FET-DEC/ISS-REG-EXE-WRB
ISSUE stage check for WAW and structure hazards
Pipeline stalls on output dependence
Allows only 1 pending update
REG stage
Resolve RAW hazards
Instructions are sent to FUs out of order
WRB stage
Once the execution completes, check for WAR
hazards
Instruction buffers
Instruction buffer between FET and DEC/ISS stages
Can be omitted
(Centralized) instruction window between ISS and
REG stages

9
Tomasulos Algorithm - Reservation Station

Used in IBM 360/91 floating point unit (1967)
Three ideas
OOO execution using reservation stations (RS)
Distributed instruction windows
Register renaming to remove anti and output
dependencies
Read available input operands from RF and store
them into RS (WAR removal)
Assign new storage for output (WAW removal)
Pipeline does not stall on WAW and WAR hazards
Data forwarding using common data bus
Bypass the data directly to the waiting
instructions in RS
Both register file and RS (source and dest)
monitor the result bus and update data when a
matching tag is found

10
Tomasulos Algorithm

FET-DEC/REN/ISS-REG-EXE-WRB-COM
REN/ISS stage check structural hazard
(reservation station entry) and read available
operands from register file (register renaming
for WAR) and assign RS entry for destination (WAW
hazard)
REG stage monitor common data bus and read
operands into RS if there is a match determine
highest priority operations among ready
operations (wakeup)
EXE execute and forward result to RS and RF
Instruction buffers
Instruction queue between FET and DEC/ISS stages
Can be omitted
Reservation station between ISS and REG stages
Reorder buffer between WRB and COM stages
Not in original proposal (IBM 360/91)

11
Renaming

Removes anti and output dependencies
Allows more than one pending update
Several forms of renaming
Tomasulos algorithm
Reservation station for additional storage for
name dependencies and common data bus for data
bypass
Reorder buffer with associative lookup
Associative lookup maps the reg id to the reorder
buffer entry as soon as an entry is allocated
Register map table with separate physical
register file
Register map table (DEC 21264)
Registe alias table (Intel P6)

12
Renaming

Assign one physical register for every
instruction with a destination register
With 80 instructions in flight (reorder buffer
size)
You need roughly 80 physical registers (except
branch and stores)
Physical registers are single-assignment
registers
Register renaming involves data dependence
checking among the instructions that are
simultaneously being renamed
Renaming bandwidth limited by
Data dependence checking
Number of read ports needed for register map
table

13
Renaming
14
Rename Example (P6)
15
Rename Example (P6)
16
Rename Example (P6)
17
Rename Example (P6)
18
PowerPC 620 - OOO example -
19
DEC 21264 - OOO example -
20
DEC 21264 - OOO example -
21
Intel P6 - OOO example -
22
Exercises and Discussion

There can be many instruction buffers in an OOO
processor. Name those buffers and explain their
functions.
What happens on a branch misprediction in OOO
processors?

Write a Comment

User Comments (0)