Superscalar Techniques

About This Presentation

Title:

Superscalar Techniques

Description:

... have some additional penalty to start fetching instructions from the new target ... For operand fetch - occurs during decode/dispatch stage. Destination allocate ... – PowerPoint PPT presentation

Number of Views:218

Avg rating:3.0/5.0

Slides: 105

Provided by: sur78

Category:

more less

Transcript and Presenter's Notes

Title: Superscalar Techniques

1
Chapter 5

Superscalar Techniques

2
3 major issues

Instruction flow
Register data flow
Memory data flow
Major challenges
Branch instruction processing
ALU instructions
Load/store instructions

3
Instruction Flow

Control dependences
Control Flow Graph (CFG)
Nodes - instructions
Edges control flow
Branches
Jumps
Calls, Returns

4
Program Control Flow
5
Performance Degradation due to branches

In a scalar processor, may be 3 cycles
In a superscalar processor cycles, 3N cycles in
an N-wide processor
Conditional branches 2 issues
Condition resolution
Target address generation

6
Branch Penalty

Unconditional branches
Only target needs to be resolved
But addressing mode of branch
PC relative
Indirect branches

7
Disruption of Sequential Control Flow by Branch
Instructions

3 cycle penalty for
Conditional branches
(D, Dis, and Ex stages)
If 4-way superscalar,
Penalty eqvt to
43 cycles
Cond branches-
Bottleneck could be
Condition resolution or
Target address calculation

8
Branch Target Address Generation Penalties

Target address calculation penalty varies for
different types of branches
PC relative 1
Register indirect 2
Register indirect with offset -3

9
Branch Condition Resolution Penalties

Different for different types of branches
If CC (condition code) used 2 cycles
If GP register to be compared 3 cycles

10
Static Branch Prediction

No hardware to predict branch at run time
Always Not Taken
Backwards branch always taken
Compiler assisted branch prediction
Depending on opcode
Calls
Returns
Loops

11
Dynamic Branch Prediction

Static Branch Prediction gets 70-80 correctness
Dynamic branch predictors achieve 80-95
correctness in branch direction prediction
Problems with static prediction
A branch which is taken during first half of
program and not taken during second half

12
FSM Model

History based FSMs

13
Branch Prediction Techniques

Branch target speculation
Branch condition speculation
Branch target speculation
BTB (branch target buffer)
Branch instruction address (BIA)
Branch target address (BTA)
BTB is like a cache

14
Branch Target Speculation
15
Branch Condition Speculation

Hardware to predict Taken/Not-Taken
Always predict Not Taken
50 branches T
Is predicted taken equal to predicted NT?
No, because taken branches will have some
additional penalty to start fetching instructions
from the new target

16
Branch Condition Speculation

Let us say 3 cycles to fill pipe with
instructions from the new target
Always not taken no stalls per correct
prediction
Always taken 3 stalls per correct prediction
Assume 4 cycles per wrong prediction
Assuming 20 branches,
CPI 1 1 .054
CPI 2 1 .054 .053

17
History-based Branch Prediction

2-bit branch predictor
If count is 2 or 3 predict taken
If count is 0 or 1 predict not taken
A 2-bit counter for each branch?
An array of 2-bit counters
How to access this array?
How to index into it?
Organized like a cache

18
History-based Branch Prediction

2-bit branch predictor

19
Optimal 2-bit Branch Predictors
Counter a 2-bit saturating counter
20
Two Aspects of Branch Prediction (a) Branch
Speculation (b) Branch Validation/Recovery
21
Instead of a single unified BTB, BTAC 64
entries, FA BHT 512 entries, DM FA (Fetch
Address) sent To both. BTAC 1 cycle BHT 2
cycles
Branch Prediction in PowerPC 604
22
Power PC 604 contd
BTAC 64 entries, FA BHT 512 entries, DM BTAC
1 cycle BHT 2 cycles 4 entries in Branch
Reservation Station Up to 4 speculative branch
instructions 2-bit tag used to identify
speculative instructions After a branch
resolves, speculative instrns made
non-speculative or invalidated.
23
Two-level Branch Prediction of Yeh and Patt
24
Correlated Branch Predictor with Global BHSR
25
Correlated Branch Predictor with Individual BHSRs
26
gshare Correlated Branch Predictor

Scott McFarling

27
Two-level predictors (contd)

BHSR G- global, P indiv
PHT g global, p-indiv, s- shared
A Adaptive
3 predictors with 97 prediction
GAg 1 BHSR (18 bits), 1 PHT 218 X 2 bits
PAg 512 X 4-way SA BHSRs of 12 bits, 1 PHT of
212 X 2 bits
PAs 512 X 4-way BHSRs of 6 bits, 512 PHTs of 26
X 2 bits (PHT is DM)

28
Two-level predictors (contd)

2-level predictors give 95 accuracy
Traditional predictors give 90
Processors starting with PentiumPro and AMD Nx686
onwards use 2-level predictors

29
3 major issues

Instruction flow
Register data flow
Memory data flow
Major challenges
Branch instruction processing
ALU instructions
Load/store instructions

30
Register Data Flow Techniques

Efficient execution of ALU type instructions in
the execution core
Real work
View of memory and control flow as supporting
Memory instructions provide data
Control flow instructions provide the right
instructions
Concept of useful and overhead instructions

31
ALU Instructions

Ri lt- Fn(Rj,Rk)
Source registers Rj, Rk
Destination register Ri
Operation Fn
If source operands in Rj or Rk are not available
- true data dependency
If destination register Ri not available, anti or
output dependence

32
Register Reuse and False Dependencies

Anti and output dependence are due to register
reuse
Called false dependencies
Register reuse is also called register recycling
Compiler performs code generation and register
allocation

33
Register Allocation

Single assignment code code in which each
symbolic register is used to store one value and
written only once
Practical ISAs have limited number of registers
Register coloring algorithm
Register live-range

34
Register Renaming

Dynamically assign different names to the
multiple definitions of an architected register
r4 lt- r3 r2 r4 lt- r3 r2
r6 lt- r4 r5 r6 lt- r4 r5
r4 lt- r6 r7 r8 lt- r6 r7
(a) (b)

35
Register Renaming

Single assignment is effectively done for
instructions that are in flight
Eliminate all false dependences (anti and output)
between the instructions in flight
Common techniques
Rename Register File (RRF) Architected register
file (ARF) Mapping table
Reorder Buffer (ROB)

36
Rename Register File (RRF)
37
Register Renaming Tasks

Source Read
For operand fetch - occurs during decode/dispatch
stage
Destination allocate
During decode subtasks are set busy bit, assign
tag, update map table
Register update
At the end of execution update RRF first and
then ARF

38
Source Read 3 possibilities

Busy bit in ARF not set
ARF contains the operand
Busy bit in ARF set
there is a pending write to the ARF register
content of ARF is stale map table used to get
RRF tag to index into RRF.
Valid bit of RRF set source operand is in RRF
Valid bit of RRF not set RRF has a pending
update tag forwarded to reservation station
instead of source operand R. S will get data
later by forwarding.

39
Register Renaming Tasks
40
Integrating RRF and ARF

Although discussion so far was with separate RRF
and ARF, no need
A single register file with number of entries
equal to RRF ARF sufficient
Pooled register file
Each physical register can be flexibly assigned
to be AR (architected register) or RR (rename
register)
Pooled register file no need to copy result for
final update

41
Floating-Point Unit (FPU) Register Renaming

Example for Pooled Register File

42
Pooled Register File in IBM RS/6000

40 physical registers, 32 architected or logical
registers
Mapping table contains 32 entries each 6 bit
6 bits specify the physical register
Rename pipe stage contains map table, two
circular queues and control logic
Map table must have 4 ports due to fused multiply
add which needs 3 registers

43
Pooled Register File in IBM RS/6000

First queue - Free list (FL)
Second queue Pending target return queue (PTRQ)
FL contains registers available for renaming
PTRQ contains registers already in use for
renaming
Figure shows initial condition with PTRQ empty

44
Pooled Register File in IBM RS/6000

Map table contains the latest mapping of each
logical register.
When a new instruction needs the register as
destination, current entry of map table is pushed
into PTRQ
Subsequent instructions that need this register
as source will receive the new physical register
specifier as the source

45
True Data Dependencies and the Data flow limit

RAW dependencies cannot be eliminated by renaming
Producer consumer relationship between 2
instructions
Imposes serialization between 2 dependent
instructions
Data dependence graph (DDG) used to represent
such true dependences

46
FFT Code Fragment
47
Data Flow Graph of the Code Fragment
48
DDG

Instructions are nodes
Edges are dependences
Edges can be marked with latencies
Critical path of a DFG longest dependence chain
measured in terms of total cumulative latency
Data flow limit
12 cycles for FFT example

49
Tomasulos Algorithm

IBM 360/91 FP unit had dynamic scheduling
Tomasulo 1967
Several contemporary superscalar out of order
processors draw a lot of ideas from IBM 360/91

50
Original Design of IBM 360 FPU
51
Modified Design of IBM 360 FPU
52
Use of Tag Fields
53
Example Instruction sequence

W R4lt- R0R8
X R2 lt- R0R4
Y R4 lt- R4 R8
Z R8 lt- R4 R2

54
Tomasulos Algorithm
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2 R06.0
R87.8
W finishes _at_2 Result Broadcast Not written In
R4
55
Tomasulos Algorithm
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2 R06.0
R87.8
_at_4 Y finishes Updates R4
_at_5 X finishes Updates R2
Z starts _at_6
56
Data Flow Graphs of Example Instruction sequence
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2
57
Dynamic Execution Core

Out of order execution core also called dynamic
execution core
Tries to achieve data flow limit
3 steps in dynamic execution
Instruction dispatching
Instruction execution
Instruction completion

58
Instruction Dispatch Phase

Rename destination registers
Allocate reservation station and ROB entries
Advance instructions from the dispatch buffer to
the reservation stations (RS)
ROB entries allocated in program order
Rename register, RS entry, and ROB entry must be
available to be able to dispatch an instruction

59
Instruction Execution Phase

Issue Ready Instructions
Execute issued instructions
Forward results
RS is responsible for identifying ready
instructions
Ready means all source operands are available
Waiting instrns continually monitor the buses for
operands

60
Micro-Dataflow Engine for dynamic Execution
61
Instruction Execution Phase contd

Monitor the buses for operands using tags
Result buses come with results and tags
When tag matches, result captured from result bus
into RS entry
Result is also going into register file
When all operands available, instruction ready
for issue
If multiple instructions ready, a scheduling
algorithm (eg oldest, most critical)

62
Instruction Execution Phase contd

FUs have varying latencies
Single cycle latency FU
Multi-cycle (fixed) latency FU
Multi-cycle variable latency
When instruction finishes, destination tag and
result broadcast on result bus
Result bus also called forwarding bus
Tag is specifier of the rename register assigned
for destination of this instruction

63
Instruction Execution Phase contd

When instruction finishes, destination tag and
result broadcast on result bus
All dependent instructions waiting in the RS
will trigger a tag match and will latch in the
broadcasted result
This instruction forwarding does not need writing
result into register and reading from there
The destination tag is also used to update RRF

64
Instruction Execution Phase contd

RS entry usually deallocated when instruction
issued
Another trailing instruction can now be
dispatched into the RS
RS saturation can cause instruction stalls
RS helps to achieve data flow limit
RS helps to eliminate WAR dependency if it copies
operands

65
Reservation Station ROB

RS and ROB are critical components of out of
order execution
Issues associated with management of these
component determine efficiency of superscalar
execution
Loading and unloading entries of RS and ROB
should be managed well

66
Reservation Station Structure

3 tasks of RS - Dispatching, Waiting and Issuing
RS fields
Operand 1, Valid field for Op1
Operand 2, Valid for Operand 2
Busy for entire RS entry
Ready to indicate ready to be issued all source
operands available

67
Reservation Station Mechanisms
68
Reservation Station contd

Dispatching into RS 3 steps
Select a free (i.e not busy) RS entry
Load operands/tags into the selected entry
Set busy bit of that entry
Instructions with pending operands are not ready
Tag match occurs and instruction receives all
operands -gt instruction wake up

69
Wake up and Select Logic

Wake up logic checks for tag match and sets ready
bit of instructions when all operands received
Associative operation involved because a tag on
the bus needs to be compared against all
instructions waiting in RS
Select logic selects an instruction to be issued
Scheduling heuristic

70
ROB design issues

ROB contains all instructions in flight
Does RS contain all instructions in flight?
Each instrn can be waiting for execution, in
execution, waiting for completion after execution
Status bits to indicate these
Bit to indicate whether instruction is in
speculative path
When branch resolved, speculative -gt
nonspeculative
Only non speculative instrns can be retired

71
ROB organization

ROB fields
ROB managed as circular queue
Head pointer and tail pointer
Tail pointer advanced when ROB entry allocated at
dispatch
Dispatch bandwidth number of entries allocated
per cycle
Instrns completed from head of queue
Completion bandwidth

72
Reorder Buffer Entry and Org
73
ROB issues contd

Completion bandwidth determined by routing
network and ports available for register
writeback
Data copying from ROB /RRF to ARF
RS ROB can be one structure called RUU
(Register Update Unit) or instruction window

74
Dynamic Instruction Scheduler

Dynamic scheduling involves instruction window
(RSROB), wake up and select logic
Instruction scheduler with data capture
Scheduler without data capture

75
Instruction Scheduling without and with data
capture

(a) with data capture (b)without data capture

76
Dynamic Instruction Scheduler

Instruction scheduler with data capture
RS copies operands or tags at dispatch
Scheduler without data capture
no copying of operands, only tags
Scheduler performs tag match to wake up ready
instructions
Operands obtained from RF just prior to execution
Many new processors do it this way

77
Other Register Data Flow techniques

Is data flow limit fundamental?
Value prediction
Lipasti, Wilkerson, Shen
Predict load values
Values loaded by many load instructions are quite
predictable
Value locality

78
Memory Data Flow techniques

Not all data can be in registers
Spill code from compilers leads to load/store in
Dynamic scheduling (out of ordering) of load and
store instructions is important
Long latency of loads
A load that is a cache miss should not block
another later load which could go

79
Memory Accessing Instructions

Steps in memory instructions
Memory address generation
Address not in instrn
Address computed from regoffset
Memory address translation
To support Virtual memory
Memory sharing and protection issues
Data memory accessing

80
Processing of Load/Store Instructions
81
Load/Store Pipes

Address generation, address translation, memory
access stages 3 stages
Look at Fig 5-30 L/S Unit
First pipe stage add register with offset
Second pipe stage TLB access, if TLB miss, page
table access, even possible page fault
Page fault typically handled as exception

82
Load/Store Pipes

3rd pipe stage
Load access data memory
Cache miss possible
Store instruction can be considered as finished
at the end of second stage
Data in register or ROB moved to store buffer
Store buffer is a FIFO buffer
Store instruction can be architecturally complete
but not retired to memory
Only non-speculative stores are retired
When exceptions occur, stores until the exception
retired, rest flushed

83
Ordering of memory accesses - L/S dependencies

RAW
WAR
RAR
WAW
These memory dependencies must be enforced for
program correctness

84
Ordering of memory access

Total ordering of loads and stores is safe but
not required
Total ordering is very conservative
Independent loads could be allowed to go ahead of
pending stores
A load might be stuck with a cache miss other
loads ahead of it could be allowed to proceed
If a load from an address that is yet to be
stored, load cannot go forward
Load could be serviced from store buffer if
addresses known

85
Memory aliasing

If a load and store refers to the same memory
location, there is an aliasing or collision.
Consider
store 4, 100(3)
..
load 6, 200(2)
Is the load independent of the store?

86
Memory aliasing

store 4, 100(3)
..
load 6, 200(2)
Is the load is independent of the store?
What if 3300 and 2200?
Cannot be sure of the independence until address
calculated

87
DAXPY Example

From LINPACK benchmark

88
DAXPY Example

Can you reorder loads and stores here?

89
DAXPY dependencies

Dependencies inside an iteration
Dependencies between iterations
Load instructions from a future iteration could
go ahead of store instructions from current
iteration
Loads could be allowed to go OOO without toomuch
difficulty
Stores are never usually allowed to go OOO

90
Load Bypassing and Load Forwarding

Load bypassing Allow load instructions to jump
ahead of other preceding store instructions if
the load address does not alias with the
preceding stores i.e. no memory dependencies with
preceding store
Load forwarding If a trailing load aliases with
preceding store, if the load can receive its data
from the store via forwarding

91
Early Execution of Load Instructions

(a) Load bypassing (b) Load Forwarding

92
Mechanisms for Load/Store Processing
93
Illustration of Load Bypassing
94
Illustration of Load Forwarding
95
Fully Out-of-Order Issuing and Execution
96
Memory dependence prediction

Memory dependence checking
Store p store 4, 100(3)
Load q load 5, 200(2)
If addresses unknown, conservatively we just
assume that pq
Other option speculate whether pq, and
proceed. If later your prediction was wrong,
correct the misspeculation

97
Memory dependence prediction

Memory disambiguation
Disambiguate the addresses
Memory disambiguation techniques can make a big
difference in performance
When multiple issue is combined with memory
dependence issues, schemes can be very complex

98
Multiported memories

Superscalar means multiple instruction issue
hence multiple loads could be happening in same
cycle
High cache and memory bandwidth required to
support an aggressive processor
Multiple ports on caches help
Multiple load/store pipes required

99
Non blocking memories

When a cache miss, should following hits be
serviced?
Will a cache freeze on a miss?
Blocking and non-blocking caches
Superscalar execution needs non-blocking caches
otherwise poor performance
Lw1 -----cache miss
Lw2 ------ will hit in cache
Lw3 ----- will hit in cache

100
MSHRs

MSHR Miss Status Holding register
Is the hardware support needed for handling
nonblocking misses
MSHRs of O(4) can allow up to 4 outstanding
misses
Needs a Missed load queue
Missed Load queue holds the missing load when
data arrives, exits missed load queue and finishes

101
Dual-ported and Nonblocking Data Cache

Dual ported and non-blocking

102
Prefetching

Hardware and Software Prefetching
Prefetching Cache
Anticipates future misses and triggers these
missed early in the hope of bringing the data
before actual load happens
Memory reference prediction table
Prefetch queue

103
Software Prefetching

Compiler inserts prefetching instructions to
trigger prefetching of data into cache very early
Actual load instruction will hit if prefetch
happens in time
Loop Unrolling
Load hoisting
Software pipelining
Explicit Software Prefetching

104
Prefetching Data Cache

Write a Comment

User Comments (0)