Title: Superscalar Techniques
1Chapter 5
23 major issues
- Instruction flow
- Register data flow
- Memory data flow
- Major challenges
- Branch instruction processing
- ALU instructions
- Load/store instructions
3Instruction Flow
- Control dependences
- Control Flow Graph (CFG)
- Nodes - instructions
- Edges control flow
- Branches
- Jumps
- Calls, Returns
4Program Control Flow
5Performance Degradation due to branches
- In a scalar processor, may be 3 cycles
- In a superscalar processor cycles, 3N cycles in
an N-wide processor - Conditional branches 2 issues
- Condition resolution
- Target address generation
6Branch Penalty
- Unconditional branches
- Only target needs to be resolved
- But addressing mode of branch
- PC relative
- Indirect branches
7Disruption of Sequential Control Flow by Branch
- 3 cycle penalty for
- Conditional branches
- (D, Dis, and Ex stages)
- If 4-way superscalar,
- Penalty eqvt to
- 43 cycles
- Cond branches-
- Bottleneck could be
- Condition resolution or
- Target address calculation
8Branch Target Address Generation Penalties
- Target address calculation penalty varies for
different types of branches - PC relative 1
- Register indirect 2
- Register indirect with offset -3
9Branch Condition Resolution Penalties
- Different for different types of branches
- If CC (condition code) used 2 cycles
- If GP register to be compared 3 cycles
10Static Branch Prediction
- No hardware to predict branch at run time
- Always Not Taken
- Backwards branch always taken
- Compiler assisted branch prediction
- Depending on opcode
- Calls
- Returns
- Loops
11Dynamic Branch Prediction
- Static Branch Prediction gets 70-80 correctness
- Dynamic branch predictors achieve 80-95
correctness in branch direction prediction - Problems with static prediction
- A branch which is taken during first half of
program and not taken during second half
12FSM Model
13Branch Prediction Techniques
- Branch target speculation
- Branch condition speculation
- Branch target speculation
- BTB (branch target buffer)
- Branch instruction address (BIA)
- Branch target address (BTA)
- BTB is like a cache
14Branch Target Speculation
15Branch Condition Speculation
- Hardware to predict Taken/Not-Taken
- Always predict Not Taken
- 50 branches T
- Is predicted taken equal to predicted NT?
- No, because taken branches will have some
additional penalty to start fetching instructions
from the new target
16Branch Condition Speculation
- Let us say 3 cycles to fill pipe with
instructions from the new target - Always not taken no stalls per correct
prediction - Always taken 3 stalls per correct prediction
- Assume 4 cycles per wrong prediction
- Assuming 20 branches,
- CPI 1 1 .054
- CPI 2 1 .054 .053
17History-based Branch Prediction
- 2-bit branch predictor
- If count is 2 or 3 predict taken
- If count is 0 or 1 predict not taken
- A 2-bit counter for each branch?
- An array of 2-bit counters
- How to access this array?
- How to index into it?
- Organized like a cache
18History-based Branch Prediction
19Optimal 2-bit Branch Predictors
Counter a 2-bit saturating counter
20Two Aspects of Branch Prediction (a) Branch
Speculation (b) Branch Validation/Recovery
21Instead of a single unified BTB, BTAC 64
entries, FA BHT 512 entries, DM FA (Fetch
Address) sent To both. BTAC 1 cycle BHT 2
Branch Prediction in PowerPC 604
22Power PC 604 contd
BTAC 64 entries, FA BHT 512 entries, DM BTAC
1 cycle BHT 2 cycles 4 entries in Branch
Reservation Station Up to 4 speculative branch
instructions 2-bit tag used to identify
speculative instructions After a branch
resolves, speculative instrns made
non-speculative or invalidated.
23Two-level Branch Prediction of Yeh and Patt
24Correlated Branch Predictor with Global BHSR
25Correlated Branch Predictor with Individual BHSRs
26gshare Correlated Branch Predictor
27Two-level predictors (contd)
- BHSR G- global, P indiv
- PHT g global, p-indiv, s- shared
- A Adaptive
- 3 predictors with 97 prediction
- GAg 1 BHSR (18 bits), 1 PHT 218 X 2 bits
- PAg 512 X 4-way SA BHSRs of 12 bits, 1 PHT of
212 X 2 bits - PAs 512 X 4-way BHSRs of 6 bits, 512 PHTs of 26
X 2 bits (PHT is DM)
28Two-level predictors (contd)
- 2-level predictors give 95 accuracy
- Traditional predictors give 90
- Processors starting with PentiumPro and AMD Nx686
onwards use 2-level predictors
293 major issues
- Instruction flow
- Register data flow
- Memory data flow
- Major challenges
- Branch instruction processing
- ALU instructions
- Load/store instructions
30Register Data Flow Techniques
- Efficient execution of ALU type instructions in
the execution core - Real work
- View of memory and control flow as supporting
- Memory instructions provide data
- Control flow instructions provide the right
instructions - Concept of useful and overhead instructions
31ALU Instructions
- Ri lt- Fn(Rj,Rk)
- Source registers Rj, Rk
- Destination register Ri
- Operation Fn
- If source operands in Rj or Rk are not available
- true data dependency - If destination register Ri not available, anti or
output dependence
32Register Reuse and False Dependencies
- Anti and output dependence are due to register
reuse - Called false dependencies
- Register reuse is also called register recycling
- Compiler performs code generation and register
33Register Allocation
- Single assignment code code in which each
symbolic register is used to store one value and
written only once - Practical ISAs have limited number of registers
- Register coloring algorithm
- Register live-range
34Register Renaming
- Dynamically assign different names to the
multiple definitions of an architected register - r4 lt- r3 r2 r4 lt- r3 r2
- r6 lt- r4 r5 r6 lt- r4 r5
- r4 lt- r6 r7 r8 lt- r6 r7
- (a) (b)
35Register Renaming
- Single assignment is effectively done for
instructions that are in flight - Eliminate all false dependences (anti and output)
between the instructions in flight - Common techniques
- Rename Register File (RRF) Architected register
file (ARF) Mapping table - Reorder Buffer (ROB)
36Rename Register File (RRF)
37Register Renaming Tasks
- Source Read
- For operand fetch - occurs during decode/dispatch
stage - Destination allocate
- During decode subtasks are set busy bit, assign
tag, update map table - Register update
- At the end of execution update RRF first and
then ARF
38Source Read 3 possibilities
- Busy bit in ARF not set
- ARF contains the operand
- Busy bit in ARF set
- there is a pending write to the ARF register
content of ARF is stale map table used to get
RRF tag to index into RRF. - Valid bit of RRF set source operand is in RRF
- Valid bit of RRF not set RRF has a pending
update tag forwarded to reservation station
instead of source operand R. S will get data
later by forwarding.
39Register Renaming Tasks
40Integrating RRF and ARF
- Although discussion so far was with separate RRF
and ARF, no need - A single register file with number of entries
equal to RRF ARF sufficient - Pooled register file
- Each physical register can be flexibly assigned
to be AR (architected register) or RR (rename
register) - Pooled register file no need to copy result for
final update
41Floating-Point Unit (FPU) Register Renaming
- Example for Pooled Register File
42Pooled Register File in IBM RS/6000
- 40 physical registers, 32 architected or logical
registers - Mapping table contains 32 entries each 6 bit
- 6 bits specify the physical register
- Rename pipe stage contains map table, two
circular queues and control logic - Map table must have 4 ports due to fused multiply
add which needs 3 registers
43Pooled Register File in IBM RS/6000
- First queue - Free list (FL)
- Second queue Pending target return queue (PTRQ)
- FL contains registers available for renaming
- PTRQ contains registers already in use for
renaming - Figure shows initial condition with PTRQ empty
44Pooled Register File in IBM RS/6000
- Map table contains the latest mapping of each
logical register. - When a new instruction needs the register as
destination, current entry of map table is pushed
into PTRQ - Subsequent instructions that need this register
as source will receive the new physical register
specifier as the source
45True Data Dependencies and the Data flow limit
- RAW dependencies cannot be eliminated by renaming
- Producer consumer relationship between 2
instructions - Imposes serialization between 2 dependent
instructions - Data dependence graph (DDG) used to represent
such true dependences
46FFT Code Fragment
47Data Flow Graph of the Code Fragment
- Instructions are nodes
- Edges are dependences
- Edges can be marked with latencies
- Critical path of a DFG longest dependence chain
measured in terms of total cumulative latency - Data flow limit
- 12 cycles for FFT example
49Tomasulos Algorithm
- IBM 360/91 FP unit had dynamic scheduling
- Tomasulo 1967
- Several contemporary superscalar out of order
processors draw a lot of ideas from IBM 360/91
50Original Design of IBM 360 FPU
51Modified Design of IBM 360 FPU
52Use of Tag Fields
53Example Instruction sequence
- W R4lt- R0R8
- X R2 lt- R0R4
- Y R4 lt- R4 R8
- Z R8 lt- R4 R2
54Tomasulos Algorithm
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2 R06.0
W finishes _at_2 Result Broadcast Not written In
55Tomasulos Algorithm
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2 R06.0
_at_4 Y finishes Updates R4
_at_5 X finishes Updates R2
Z starts _at_6
56Data Flow Graphs of Example Instruction sequence
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2
57Dynamic Execution Core
- Out of order execution core also called dynamic
execution core - Tries to achieve data flow limit
- 3 steps in dynamic execution
- Instruction dispatching
- Instruction execution
- Instruction completion
58Instruction Dispatch Phase
- Rename destination registers
- Allocate reservation station and ROB entries
- Advance instructions from the dispatch buffer to
the reservation stations (RS) - ROB entries allocated in program order
- Rename register, RS entry, and ROB entry must be
available to be able to dispatch an instruction
59Instruction Execution Phase
- Issue Ready Instructions
- Execute issued instructions
- Forward results
- RS is responsible for identifying ready
instructions - Ready means all source operands are available
- Waiting instrns continually monitor the buses for
60Micro-Dataflow Engine for dynamic Execution
61Instruction Execution Phase contd
- Monitor the buses for operands using tags
- Result buses come with results and tags
- When tag matches, result captured from result bus
into RS entry - Result is also going into register file
- When all operands available, instruction ready
for issue - If multiple instructions ready, a scheduling
algorithm (eg oldest, most critical)
62Instruction Execution Phase contd
- FUs have varying latencies
- Single cycle latency FU
- Multi-cycle (fixed) latency FU
- Multi-cycle variable latency
- When instruction finishes, destination tag and
result broadcast on result bus - Result bus also called forwarding bus
- Tag is specifier of the rename register assigned
for destination of this instruction
63Instruction Execution Phase contd
- When instruction finishes, destination tag and
result broadcast on result bus - All dependent instructions waiting in the RS
will trigger a tag match and will latch in the
broadcasted result - This instruction forwarding does not need writing
result into register and reading from there - The destination tag is also used to update RRF
64Instruction Execution Phase contd
- RS entry usually deallocated when instruction
issued - Another trailing instruction can now be
dispatched into the RS - RS saturation can cause instruction stalls
- RS helps to achieve data flow limit
- RS helps to eliminate WAR dependency if it copies
65Reservation Station ROB
- RS and ROB are critical components of out of
order execution - Issues associated with management of these
component determine efficiency of superscalar
execution - Loading and unloading entries of RS and ROB
should be managed well
66Reservation Station Structure
- 3 tasks of RS - Dispatching, Waiting and Issuing
- RS fields
- Operand 1, Valid field for Op1
- Operand 2, Valid for Operand 2
- Busy for entire RS entry
- Ready to indicate ready to be issued all source
operands available
67Reservation Station Mechanisms
68Reservation Station contd
- Dispatching into RS 3 steps
- Select a free (i.e not busy) RS entry
- Load operands/tags into the selected entry
- Set busy bit of that entry
- Instructions with pending operands are not ready
- Tag match occurs and instruction receives all
operands -gt instruction wake up
69Wake up and Select Logic
- Wake up logic checks for tag match and sets ready
bit of instructions when all operands received - Associative operation involved because a tag on
the bus needs to be compared against all
instructions waiting in RS - Select logic selects an instruction to be issued
- Scheduling heuristic
70ROB design issues
- ROB contains all instructions in flight
- Does RS contain all instructions in flight?
- Each instrn can be waiting for execution, in
execution, waiting for completion after execution - Status bits to indicate these
- Bit to indicate whether instruction is in
speculative path - When branch resolved, speculative -gt
nonspeculative - Only non speculative instrns can be retired
71ROB organization
- ROB fields
- ROB managed as circular queue
- Head pointer and tail pointer
- Tail pointer advanced when ROB entry allocated at
dispatch - Dispatch bandwidth number of entries allocated
per cycle - Instrns completed from head of queue
- Completion bandwidth
72Reorder Buffer Entry and Org
73ROB issues contd
- Completion bandwidth determined by routing
network and ports available for register
writeback - Data copying from ROB /RRF to ARF
- RS ROB can be one structure called RUU
(Register Update Unit) or instruction window
74Dynamic Instruction Scheduler
- Dynamic scheduling involves instruction window
(RSROB), wake up and select logic - Instruction scheduler with data capture
- Scheduler without data capture
75Instruction Scheduling without and with data
- (a) with data capture (b)without data capture
76Dynamic Instruction Scheduler
- Instruction scheduler with data capture
- RS copies operands or tags at dispatch
- Scheduler without data capture
- no copying of operands, only tags
- Scheduler performs tag match to wake up ready
instructions - Operands obtained from RF just prior to execution
- Many new processors do it this way
77Other Register Data Flow techniques
- Is data flow limit fundamental?
- Value prediction
- Lipasti, Wilkerson, Shen
- Predict load values
- Values loaded by many load instructions are quite
predictable - Value locality
78Memory Data Flow techniques
- Not all data can be in registers
- Spill code from compilers leads to load/store in
- Dynamic scheduling (out of ordering) of load and
store instructions is important - Long latency of loads
- A load that is a cache miss should not block
another later load which could go
79Memory Accessing Instructions
- Steps in memory instructions
- Memory address generation
- Address not in instrn
- Address computed from regoffset
- Memory address translation
- To support Virtual memory
- Memory sharing and protection issues
- Data memory accessing
80Processing of Load/Store Instructions
81Load/Store Pipes
- Address generation, address translation, memory
access stages 3 stages - Look at Fig 5-30 L/S Unit
- First pipe stage add register with offset
- Second pipe stage TLB access, if TLB miss, page
table access, even possible page fault - Page fault typically handled as exception
82Load/Store Pipes
- 3rd pipe stage
- Load access data memory
- Cache miss possible
- Store instruction can be considered as finished
at the end of second stage - Data in register or ROB moved to store buffer
- Store buffer is a FIFO buffer
- Store instruction can be architecturally complete
but not retired to memory - Only non-speculative stores are retired
- When exceptions occur, stores until the exception
retired, rest flushed
83Ordering of memory accesses - L/S dependencies
- These memory dependencies must be enforced for
program correctness
84Ordering of memory access
- Total ordering of loads and stores is safe but
not required - Total ordering is very conservative
- Independent loads could be allowed to go ahead of
pending stores - A load might be stuck with a cache miss other
loads ahead of it could be allowed to proceed - If a load from an address that is yet to be
stored, load cannot go forward - Load could be serviced from store buffer if
addresses known -
85Memory aliasing
- If a load and store refers to the same memory
location, there is an aliasing or collision. - Consider
- store 4, 100(3)
- ..
- load 6, 200(2)
- Is the load independent of the store?
86Memory aliasing
- store 4, 100(3)
- ..
- load 6, 200(2)
- Is the load is independent of the store?
- What if 3300 and 2200?
- Cannot be sure of the independence until address
calculated -
87DAXPY Example
88DAXPY Example
- Can you reorder loads and stores here?
89DAXPY dependencies
- Dependencies inside an iteration
- Dependencies between iterations
- Load instructions from a future iteration could
go ahead of store instructions from current
iteration - Loads could be allowed to go OOO without toomuch
difficulty - Stores are never usually allowed to go OOO
90Load Bypassing and Load Forwarding
- Load bypassing Allow load instructions to jump
ahead of other preceding store instructions if
the load address does not alias with the
preceding stores i.e. no memory dependencies with
preceding store - Load forwarding If a trailing load aliases with
preceding store, if the load can receive its data
from the store via forwarding
91Early Execution of Load Instructions
- (a) Load bypassing (b) Load Forwarding
92Mechanisms for Load/Store Processing
93Illustration of Load Bypassing
94Illustration of Load Forwarding
95Fully Out-of-Order Issuing and Execution
96Memory dependence prediction
- Memory dependence checking
- Store p store 4, 100(3)
- Load q load 5, 200(2)
- If addresses unknown, conservatively we just
assume that pq - Other option speculate whether pq, and
proceed. If later your prediction was wrong,
correct the misspeculation
97Memory dependence prediction
- Memory disambiguation
- Disambiguate the addresses
- Memory disambiguation techniques can make a big
difference in performance - When multiple issue is combined with memory
dependence issues, schemes can be very complex
98Multiported memories
- Superscalar means multiple instruction issue
hence multiple loads could be happening in same
cycle - High cache and memory bandwidth required to
support an aggressive processor - Multiple ports on caches help
- Multiple load/store pipes required
99Non blocking memories
- When a cache miss, should following hits be
serviced? - Will a cache freeze on a miss?
- Blocking and non-blocking caches
- Superscalar execution needs non-blocking caches
otherwise poor performance - Lw1 -----cache miss
- Lw2 ------ will hit in cache
- Lw3 ----- will hit in cache
- MSHR Miss Status Holding register
- Is the hardware support needed for handling
nonblocking misses - MSHRs of O(4) can allow up to 4 outstanding
misses - Needs a Missed load queue
- Missed Load queue holds the missing load when
data arrives, exits missed load queue and finishes
101Dual-ported and Nonblocking Data Cache
- Dual ported and non-blocking
- Hardware and Software Prefetching
- Prefetching Cache
- Anticipates future misses and triggers these
missed early in the hope of bringing the data
before actual load happens - Memory reference prediction table
- Prefetch queue
103Software Prefetching
- Compiler inserts prefetching instructions to
trigger prefetching of data into cache very early - Actual load instruction will hit if prefetch
happens in time - Loop Unrolling
- Load hoisting
- Software pipelining
- Explicit Software Prefetching
104Prefetching Data Cache