CENG 450 Computer Systems and Architecture Lecture 12 - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

CENG 450 Computer Systems and Architecture Lecture 12

Description:

Increases the number of instructions available for the scheduler to issue. ... Pioneer: IBM (America = RIOS, RS/6000, Power-1) Superscalar instruction combinations ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 32
Provided by: shin161
Category:

less

Transcript and Presenter's Notes

Title: CENG 450 Computer Systems and Architecture Lecture 12


1
CENG 450Computer Systems and
ArchitectureLecture 12
  • Amirali Baniasadi
  • amirali_at_ece.uvic.ca

2
This Lecture
  • Branch Prediction
  • Multiple Issue

3
Branch Prediction
  • Predicting the outcome of a branch
  • Direction
  • Taken / Not Taken
  • Direction predictors
  • Target Address
  • PCoffset (Taken)/ PC4 (Not Taken)
  • Target address predictors
  • Branch Target Buffer (BTB)

4
Why do we need branch prediction?
  • Branch prediction
  • Increases the number of instructions available
    for the scheduler to issue. Increases
    instruction level parallelism (ILP)
  • Allows useful work to be completed while waiting
    for the branch to resolve

5
Branch Prediction Strategies
  • Static
  • Decided before runtime
  • Examples
  • Always-Not Taken
  • Always-Taken
  • Backwards Taken, Forward Not Taken (BTFNT)
  • Profile-driven prediction
  • Dynamic
  • Prediction decisions may change during the
    execution of the program

6
What happens when a branch is predicted?
  • On misprediction
  • No speculative state may commit
  • Squash instructions in the pipeline
  • Must not allow stores in the pipeline to occur
  • Cannot allow stores which would not have happened
    to commit
  • Even for good branch predictors more than half of
    the fetched instructions are squashed

7
A Generic Branch Predictor
Predicted Stream PC, T or NT
Fetch
f(PC, x)
Resolve
Actual Stream f(PC, x) T or NT
Actual Stream
Execution Order
Predicted Stream
- Whats f (PC, x)? - x can be any relevant
info thus far x was empty
8
Bimodal Branch Predictors
  • Dynamically store information about the branch
    behaviour
  • Branches tend to behave in a fixed way
  • Branches tend to behave in the same way across
    program execution
  • Index a Pattern History Table using the branch
    address
  • 1 bit branch behaves as it did last time
  • Saturating 2 bit counter branch behaves as it
    usually does

9
Saturating-Counter Predictors
  • Consider strongly biased branch with infrequent
    outcome
  • TTTTTTTTNTTTTTTTTNTTTT
  • Last-outcome will misspredict twice per
    infrequent outcome encounter
  • TTTTTTTTNTTTTTTTTNTTTT
  • Idea Remember most frequent case
  • Saturating-Counter Hysteresis
  • often called bi-modal predictor
  • Captures Temporal Bias

10
Bimodal Prediction
  • Table of 2-bit saturating counters
  • Predict the most common direction
  • Advantages simple, cheap, good accuracy
  • Bimodal will misspredict once per infrequent
    outcome encounter
  • TTTTTTTTNTTTTTTTTNTTTT

11
Correlating Predictors
  • From program perspective
  • Different Branches may be correlated
  • if (aa 2) aa 0
  • if (bb 2) bb 0
  • if (aa ! bb) then
  • Can be viewed as a pattern detector
  • Instead of keeping aggregate history information
  • I.e., most frequent outcome
  • Keep exact history information
  • Pattern of n most recent outcomes
  • Example
  • BHR n most recent branch outcomes
  • Use PC and BHR (xor?) to access prediction table

12
Pattern-based Prediction
  • Nested loops
  • for i 0 to N
  • for j 0 to 3
  • Branch Outcome Stream for j-for branch
  • 11101110111011101110
  • Patterns
  • 111 -gt 0
  • 110 -gt 1
  • 101 -gt 1
  • 011 -gt 1
  • 100 accuracy
  • Learning time 4 instances
  • Table Index (PC, 3-bit history)

13
Two-level Branch Predictors
  • A branch outcome depends on the outcomes of
    previous branches
  • First level Branch History Registers (BHR)
  • Global history / Branch correlation past
    executions of all branches
  • Self history / Private history past executions
    of the same branch
  • Second level Pattern History Table (PHT)
  • Use first level information to index a table
  • Possibly XOR with the branch address
  • PHT Usually saturating 2 bit counters
  • Also private, shared or global

14
Gshare Predictor (McFarling)
Branch History Table
Global BHR
Prediction
f
PC
  • PC and BHR can be
  • concatenated
  • completely overlapped
  • partially overlapped
  • xored, etc.
  • How deep BHR should be?
  • Really depends on program
  • But, deeper increases learning time
  • May increase quality of information

15
Hybrid Prediction
  • Combining branch predictors
  • Use two different branch predictors
  • Access both in parallel
  • A third table determines which prediction to use
    Two or more predictor components combined
  • Different
  • branches benefit
  • from different types
  • of history

16
Issues Affecting Accurate Branch Prediction
  • Aliasing
  • More than one branch may use the same BHT/PHT
    entry
  • Constructive
  • Prediction that would have been incorrect,
    predicted correctly
  • Destructive
  • Prediction that would have been correct,
    predicted incorrectly
  • Neutral
  • No change in the accuracy

17
More Issues
  • Training time
  • Need to see enough branches to uncover pattern
  • Need enough time to reach steady state
  • Wrong history
  • Incorrect type of history for the branch
  • Stale state
  • Predictor is updated after information is needed
  • Operating system context switches
  • More aliasing caused by branches in different
    programs

18
Performance Metrics
  • Misprediction rate
  • Mispredicted branches per executed branch
  • Unfortunately the most usually found
  • Instructions per mispredicted branch
  • Gives a better idea of the program behaviour
  • Branches are not evenly spaced

19
Upper Limit to ILP Ideal Machine
Amount of parallelism when there are no branch
mis-predictions and were limited only by data
dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
20
Impact of Realistic Branch Prediction
  • Limiting the type of branch prediction.

FP 15 - 45
Integer 6 - 12
IPC
21
Multiple Issue
  • Multiple Issue is the ability of the processor to
    start more than one instruction in a given cycle.
  • Superscalar processors
  • Very Long Instruction Word (VLIW) processors

22
1990s Superscalar Processors
  • Bottleneck CPI gt 1
  • Limit on scalar performance (single instruction
    issue)
  • Hazards
  • Superpipelining? Diminishing returns (hazards
    overhead)
  • How can we make the CPI 0.5?
  • Multiple instructions in every pipeline stage
    (super-scalar)
  • 1 2 3 4 5 6 7
  • Inst0 IF ID EX MEM WB
  • Inst1 IF ID EX MEM WB
  • Inst2 IF ID EX MEM WB
  • Inst3 IF ID EX MEM WB
  • Inst4 IF ID EX MEM WB
  • Inst5 IF ID EX MEM WB

23
Superscalar Vs. VLIW
  • Religious debate, similar to RISC vs. CISC
  • Wisconsin Michigan (Super scalar) Vs. Illinois
    (VLIW)
  • Q. Who can schedule code better, hardware or
    software?

24
Hardware Scheduling
  • High branch prediction accuracy
  • Dynamic information on latencies (cache misses)
  • Dynamic information on memory dependences
  • Easy to speculate ( recover from
    mis-speculation)
  • Works for generic, non-loop, irregular code
  • Ex databases, desktop applications, compilers
  • Limited reorder buffer size limits lookahead
  • High cost/complexity
  • Slow clock

25
Software Scheduling
  • Large scheduling scope (full program), large
    lookahead
  • Can handle very long latencies
  • Simple hardware with fast clock
  • Only works well for regular codes (scientific,
    FORTRAN)
  • Low branch prediction accuracy
  • Can improve by profiling
  • No information on latencies like cache misses
  • Can improve by profiling
  • Pain to speculate and recover from
    mis-speculation
  • Can improve with hardware support

26
Superscalar Processors
  • Pioneer IBM (America gt RIOS, RS/6000, Power-1)
  • Superscalar instruction combinations
  • 1 ALU or memory or branch 1 FP (RS/6000)
  • Any 1 1 ALU (Pentium)
  • Any 1 ALU or FP 1 ALU 1 load 1 store 1
    branch (Pentium II)
  • Impact of superscalar
  • More opportunity for hazards (why?)
  • More performance loss due to hazards (why?)

27
Superscalar Processors
  • Issues varying number of instructions per clock
  • Scheduling Static (by the compiler) or
    dynamic(by the hardware)
  • Superscalar has a varying number of
    instructions/cycle (1 to 8), scheduled by
    compiler or by HW (Tomasulo).
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

28
Elements of Advanced Superscalars
  • High performance instruction fetching
  • Good dynamic branch and jump prediction
  • Multiple instructions per cycle, multiple
    branches per cycle?
  • Scheduling and hazard elimination
  • Dynamic scheduling
  • Not necessarily Alpha 21064 Pentium were
    statically scheduled
  • Register renaming to eliminate WAR and WAW
  • Parallel functional units, paths/buses/multiple
    register ports
  • High performance memory systems
  • Speculative execution

29
SS DS Speculation
  • Superscalar Dynamic scheduling Speculation
  • Three great tastes that taste great together
  • CPI gt 1?
  • Overcome with superscalar
  • Superscalar increases hazards
  • Overcome with dynamic scheduling
  • RAW dependences still a problem?
  • Overcome with a large window
  • Branches a problem for filling large window?
  • Overcome with speculation

30
The Big Picture
issue
Static program
Fetch branch predict
execution

Reorder commit
31
Readings
  • New paper on branch prediction online. READ.
  • Material would be used in the THIRD quiz
Write a Comment
User Comments (0)
About PowerShow.com