CENG 450 Computer Systems and Architecture Lecture 12 - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

CENG 450 Computer Systems and Architecture Lecture 12

Description:

Increases the number of instructions available for the scheduler to issue. ... Pioneer: IBM (America = RIOS, RS/6000, Power-1) Superscalar instruction combinations ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 32

Provided by: shin161

Category:

more less

Transcript and Presenter's Notes

Title: CENG 450 Computer Systems and Architecture Lecture 12

1
CENG 450Computer Systems and
ArchitectureLecture 12

Amirali Baniasadi
amirali_at_ece.uvic.ca

2
This Lecture

Branch Prediction
Multiple Issue

3
Branch Prediction

Predicting the outcome of a branch
Direction
Taken / Not Taken
Direction predictors
Target Address
PCoffset (Taken)/ PC4 (Not Taken)
Target address predictors
Branch Target Buffer (BTB)

4
Why do we need branch prediction?

Branch prediction
Increases the number of instructions available
for the scheduler to issue. Increases
instruction level parallelism (ILP)
Allows useful work to be completed while waiting
for the branch to resolve

5
Branch Prediction Strategies

Static
Decided before runtime
Examples
Always-Not Taken
Always-Taken
Backwards Taken, Forward Not Taken (BTFNT)
Profile-driven prediction
Dynamic
Prediction decisions may change during the
execution of the program

6
What happens when a branch is predicted?

On misprediction
No speculative state may commit
Squash instructions in the pipeline
Must not allow stores in the pipeline to occur
Cannot allow stores which would not have happened
to commit
Even for good branch predictors more than half of
the fetched instructions are squashed

7
A Generic Branch Predictor
Predicted Stream PC, T or NT
Fetch
f(PC, x)
Resolve
Actual Stream f(PC, x) T or NT
Actual Stream
Execution Order
Predicted Stream
- Whats f (PC, x)? - x can be any relevant
info thus far x was empty
8
Bimodal Branch Predictors

Dynamically store information about the branch
behaviour
Branches tend to behave in a fixed way
Branches tend to behave in the same way across
program execution
Index a Pattern History Table using the branch
address
1 bit branch behaves as it did last time
Saturating 2 bit counter branch behaves as it
usually does

9
Saturating-Counter Predictors

Consider strongly biased branch with infrequent
outcome
TTTTTTTTNTTTTTTTTNTTTT
Last-outcome will misspredict twice per
infrequent outcome encounter
TTTTTTTTNTTTTTTTTNTTTT
Idea Remember most frequent case
Saturating-Counter Hysteresis
often called bi-modal predictor
Captures Temporal Bias

10
Bimodal Prediction

Table of 2-bit saturating counters
Predict the most common direction
Advantages simple, cheap, good accuracy
Bimodal will misspredict once per infrequent
outcome encounter
TTTTTTTTNTTTTTTTTNTTTT

11
Correlating Predictors

From program perspective
Different Branches may be correlated
if (aa 2) aa 0
if (bb 2) bb 0
if (aa ! bb) then
Can be viewed as a pattern detector
Instead of keeping aggregate history information
I.e., most frequent outcome
Keep exact history information
Pattern of n most recent outcomes
Example
BHR n most recent branch outcomes
Use PC and BHR (xor?) to access prediction table

12
Pattern-based Prediction

Nested loops
for i 0 to N
for j 0 to 3
Branch Outcome Stream for j-for branch
11101110111011101110
Patterns
111 -gt 0
110 -gt 1
101 -gt 1
011 -gt 1
100 accuracy
Learning time 4 instances
Table Index (PC, 3-bit history)

13
Two-level Branch Predictors

A branch outcome depends on the outcomes of
previous branches
First level Branch History Registers (BHR)
Global history / Branch correlation past
executions of all branches
Self history / Private history past executions
of the same branch
Second level Pattern History Table (PHT)
Use first level information to index a table
Possibly XOR with the branch address
PHT Usually saturating 2 bit counters
Also private, shared or global

14
Gshare Predictor (McFarling)
Branch History Table
Global BHR
Prediction
f
PC

PC and BHR can be
concatenated
completely overlapped
partially overlapped
xored, etc.
How deep BHR should be?
Really depends on program
But, deeper increases learning time
May increase quality of information

15
Hybrid Prediction

Combining branch predictors
Use two different branch predictors
Access both in parallel
A third table determines which prediction to use
Two or more predictor components combined
Different
branches benefit
from different types
of history

16
Issues Affecting Accurate Branch Prediction

Aliasing
More than one branch may use the same BHT/PHT
entry
Constructive
Prediction that would have been incorrect,
predicted correctly
Destructive
Prediction that would have been correct,
predicted incorrectly
Neutral
No change in the accuracy

17
More Issues

Training time
Need to see enough branches to uncover pattern
Need enough time to reach steady state
Wrong history
Incorrect type of history for the branch
Stale state
Predictor is updated after information is needed
Operating system context switches
More aliasing caused by branches in different
programs

18
Performance Metrics

Misprediction rate
Mispredicted branches per executed branch
Unfortunately the most usually found
Instructions per mispredicted branch
Gives a better idea of the program behaviour
Branches are not evenly spaced

19
Upper Limit to ILP Ideal Machine
Amount of parallelism when there are no branch
mis-predictions and were limited only by data
dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
20
Impact of Realistic Branch Prediction

Limiting the type of branch prediction.

FP 15 - 45
Integer 6 - 12
IPC
21
Multiple Issue

Multiple Issue is the ability of the processor to
start more than one instruction in a given cycle.
Superscalar processors
Very Long Instruction Word (VLIW) processors

22
1990s Superscalar Processors

Bottleneck CPI gt 1
Limit on scalar performance (single instruction
issue)
Hazards
Superpipelining? Diminishing returns (hazards
overhead)
How can we make the CPI 0.5?
Multiple instructions in every pipeline stage
(super-scalar)
1 2 3 4 5 6 7
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM WB
Inst5 IF ID EX MEM WB

23
Superscalar Vs. VLIW

Religious debate, similar to RISC vs. CISC
Wisconsin Michigan (Super scalar) Vs. Illinois
(VLIW)
Q. Who can schedule code better, hardware or
software?

24
Hardware Scheduling

High branch prediction accuracy
Dynamic information on latencies (cache misses)
Dynamic information on memory dependences
Easy to speculate ( recover from
mis-speculation)
Works for generic, non-loop, irregular code
Ex databases, desktop applications, compilers
Limited reorder buffer size limits lookahead
High cost/complexity
Slow clock

25
Software Scheduling

Large scheduling scope (full program), large
lookahead
Can handle very long latencies
Simple hardware with fast clock
Only works well for regular codes (scientific,
FORTRAN)
Low branch prediction accuracy
Can improve by profiling
No information on latencies like cache misses
Can improve by profiling
Pain to speculate and recover from
mis-speculation
Can improve with hardware support

26
Superscalar Processors

Pioneer IBM (America gt RIOS, RS/6000, Power-1)
Superscalar instruction combinations
1 ALU or memory or branch 1 FP (RS/6000)
Any 1 1 ALU (Pentium)
Any 1 ALU or FP 1 ALU 1 load 1 store 1
branch (Pentium II)
Impact of superscalar
More opportunity for hazards (why?)
More performance loss due to hazards (why?)

27
Superscalar Processors

Issues varying number of instructions per clock
Scheduling Static (by the compiler) or
dynamic(by the hardware)
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled by
compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

28
Elements of Advanced Superscalars