Modern Computer Architectures Lecture 4: Branch Prediction MultipleIssue Processors Limits to ILP - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Modern Computer Architectures Lecture 4: Branch Prediction MultipleIssue Processors Limits to ILP

Description:

Scoreboard and Tomasulo stop issuing instructions when a branch is encountered ... PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 40
Provided by: Rand249
Category:

less

Transcript and Presenter's Notes

Title: Modern Computer Architectures Lecture 4: Branch Prediction MultipleIssue Processors Limits to ILP


1
Modern Computer ArchitecturesLecture 4 Branch
PredictionMultiple-Issue ProcessorsLimits to ILP
  • Dr. Ben Juurlink
  • Fall 2001

2
Branch PredictionMotivation
  • Scoreboard and Tomasulo stop issuing instructions
    when a branch is encountered
  • With on average one out of five instructions
    being a branch, the maximum ILP is five
  • Situation even worse for multiple-issue
    processors, because we need to provide an
    instruction stream of n instructions per cycle.
  • Idea predict the outcome of branches based on
    their history and execute instructions
    speculatively

3
5 Branch Prediction Schemes
  • 1-bit Branch Prediction Buffer
  • 2-bit Branch Prediction Buffer
  • Correlating Branch Prediction Buffer
  • Branch Target Buffer
  • Return Address Predictors
  • . A way to get rid of those malicious branches

4
1-bit Branch Prediction Buffer
  • 1-bit branch prediction buffer or branch history
    table
  • Buffer is like a cache without tags
  • Does not help for simple DLX pipeline because
    target address calculations in same stage as
    branch condition calculation

10..10 101 00
PC
BHT
0 1 0 1 0 1 1 0
5
2-bit Branch Prediction Buffer
  • sad 0
  • for (i0 ilt16 i)
  • for (j0 jlt16 j)
  • if ((v aij-bij) lt 0) v -v
  • sad v
  • Problem in a nested loop, 1-bit BHT will cause
    2 mispredictions
  • End of loop case, when it exits instead of
    looping as before
  • First time through loop on next time through
    code, when it predicts exit instead of looping
  • Only 88 accuracy even if loop 94 of the time

6
2-bit Branch Prediction Buffer
  • Solution 2-bit scheme where prediction is
    changed only if mispredicted twice
  • Can be implemented as a saturating counter

T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not Taken
Predict Not Taken
T
NT
7
Correlating Branches
  • Fragment from SPEC92 benchmark eqntott
  • if (aa2) subi R3,R1,2
  • aa 0 b1 bnez R3,L1
  • if (bb2) add R1,R0,R0
  • bb0 L1 subi R3,R2,2
  • if (a!b) b2 bnez R3,L2
  • add R2,R0,R0
  • L2 sub R3,R1,R2
  • b3 beqz R3,L3

8
Correlating Branch Predictor
4 bits from branch address
  • Idea behavior of this branch is related to
    taken/not taken history of recently executed
    branches
  • Then behavior of recent branches selects between,
    say, 4 predictions of next branch, updating just
    that prediction
  • (2,2) predictor 2-bit global, 2-bit local
  • (m,n) predictor uses behavior of last m branches
    to choose from 2m predictors, each of which is
    n-bit predictor

2-bits per branch local predictors
Prediction
shift register
2-bit global branch history (01 not taken then
taken)
9
Accuracy of Different Branch Predictors
18
Mispredictions Rate
0
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
10
BHT Accuracy
  • Mispredict because either
  • Wrong guess for that branch
  • Got branch history of wrong branch when index the
    table
  • 4096 entry table misprediction rates vary from
    1 (nasa7, tomcatv) to 18 (eqntott), with spice
    at 9 and gcc at 12
  • For SPEC92, 4096 about as good as infinite table
  • Real programs OS more like gcc

11
Branch Target Buffer
  • For DLX pipeline, need target address at same
    time as prediction
  • Branch Target Buffer (BTB) Address of branch
    index to get prediction AND branch address (if
    taken)
  • Note must check for branch match now, since can
    be any instruction

10..10 101 00
PC
Yes instruction is branch. Use predicted PC as
next PC if branch predicted taken.
?
Branch prediction
No instruction is not a branch. Proceed normally
12
Instruction Fetch Stage
  • Not shown hardware needed when prediction was
    wrong

found taken
target address
13
Special Case Return Addresses
  • Register indirect branches hard to predict
    target address
  • MIPS/DLX instruction jr r31 PC r31
  • useful for
  • implementing switch/case statements
  • FORTRAN computed GOTOs
  • procedure return (mainly)
  • SPEC89 85 such branches for procedure return
  • Since stack discipline for procedures, save
    return address in small buffer that acts like a
    stack 8 to 16 entries has small miss rate

14
Dynamic Branch Prediction Summary
  • Prediction becoming important part of scalar
    execution
  • Branch History Table 2 bits for loop accuracy
  • Correlation Recently executed branches
    correlated with next branch
  • Either different branches
  • Or different executions of same branch
  • Branch Target Buffer include branch target
    address prediction
  • Return address stack for prediction of indirect
    jumps

15
Predicated Instructions
  • Avoid branch prediction by turning branches into
    conditional or predicated instructions
  • if (x) then A B op C else NOP
  • If false, then neither store result nor cause
    exception
  • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
    conditional move PA-RISC can annul any following
    instr.
  • IA-64 conditional execution of any instruction
  • Examples
  • if (R10) R2 R3 CMOVZ R2,R3,R1
  • if (R1 lt R2) SLT R9,R1,R2
  • R3 R1 CMOVNZ R3,R1,R9
  • else CMOVZ R3,R2,R9
  • R3 R2

16
Administratrivia
  • Have you all chosen a project/topic?
  • When I received all your topics, I will make a
    schedule of the presentations. Presentations
    related to memory hierarchy last.

17
Getting CPI lt 1 Multiple-Issue Processors
  • Vector Processing Explicit coding of independent
    loops as operations on large vectors of numbers
  • Multimedia instructions being added to many
    processors
  • Multiple-Issue Processors
  • Superscalar varying no. instructions/cycle (1 to
    8), scheduled by compiler or by HW (Tomasulo)
    (dynamic issue capability)
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
    III/4
  • (Very) Long Instruction Words (V)LIW fixed
    number of instructions (4-16) scheduled by the
    compiler (static issue capability)
  • Intel Architecture-64 (IA-64)
  • Renamed Explicitly Parallel Instruction
    Computer (EPIC)
  • Anticipated success of multiple instructions led
    to Instructions Per Cycle (IPC) metric instead
    of CPI

18
Statically Scheduled Superscalar
  • Superscalar MIPS/DLX 2 instructions, 1 anything
    1 FP
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports needed for FP register file to
    execute FP load FP op in parallel
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay expands to 3 instructions in
    SS
  • instruction in right half cant use it, nor
    instructions in next slot
  • Using this design, only FP loads and convert int
    to float instructions can cause data hazards

19
Example
  • for (i1 ilt1000 i)
  • ai ais
  • Integer instruction FP instruction Cycle
  • L LD F0,0(R1) 1
  • LD F6,8(R1) 2
  • LD F10,16(R1) ADDD F4,F0,F2 3
  • LD F14,24(R1) ADDD F8,F6,F2 4
  • LD F18,32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD 8(R1),F8 ADDD F20,F18,F2 7
  • SD 16(R1),F12 8
  • ADDI R1,R1,40 9
  • SD -16(R1),F16 10
  • BNE R1,R2,L 11
  • SD -8(R1),F20 12

Load 1 cycle latency ALU op 2 cycles latency
  • 2.4 cycles per element vs. 3.5 for ordinary DLX
    pipeline
  • Int and FP instructions not perfectly balanced

20
Dynamically Scheduled SuperscalarsIssues
  • Dynamically scheduled superscalar HW potentially
    reorders instructions and sends them to correct
    execution unit. Extend Scoreboard or Tomasulo.
  • issue packet group of instructions from fetch
    unit that could potentially issue in one cycle
  • If instruction causes structural or data hazard
    either due to earlier instruction in execution or
    to earlier instruction in issue packet, then
    instruction cannot be issued
  • 0 to N instruction issues per clock cycle, for
    N-issue
  • Performing issue checks in 1 cycle could limit
    clock cycle time O(n2) comparisons
  • gt issue stage usually split and pipelined
  • 1st stage decides how many instructions from
    within this packet can issue, 2nd stage examines
    hazards among selected instructions and those
    already issued
  • gt higher branch penalties gt prediction accuracy
    important

21
Multiple Issue Issues
  • While Integer/FP split is simple for the HW, get
    IPI of 2 only for programs with
  • Exactly 50 FP operations AND no hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-issue superscalar gt examine 2 opcodes, 6
    register specifiers, and decide if 1 or 2
    instructions can issue (N-issue O(N2)
    comparisons)
  • Register file for 2-issue superscalar need 2x
    reads and 1x writes/cycle
  • Rename logic must be able to rename same
    register multiple times in one cycle! For
    instance, consider 4-way issue
  • add r1, r2, r3 add p11, p4, p7 sub r4, r1,
    r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
    4(p22) add r5, r1, r2 add p12, p23, p4
  • Imagine doing this transformation in a single
    cycle!
  • Result buses Need to complete multiple
    instructions/cycle
  • So, need multiple buses with associated matching
    logic at every reservation station.

22
VLIW Processors
  • Superscalar HW too difficult to build gt let
    compiler find independent instructions and pack
    them in one Very Long Instruction Word (VLIW)
  • Example VLIW processor with 2 ld/st units, two
    FP units, one integer/branch unit, no branch
    delay
  • Ld/st 1 Ld/st 2 FP 1 FP 2 Int
  • LD F0,0(R1) LD F6,8(R1)
  • LD F10,16(R1) LD F14,24(R1)
  • LD F18,32(R1) LD F22,40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2
  • LD F26,48(R1) ADDD ADDD
  • F12,F10,F2 F16,F14,F2
  • ADDD ADDD
  • F20,F18,F2 F24,F22,F2
  • SD 0(R1),F4 SD 8(R1),F8 ADDD
  • F28,F26,F2
  • SD 16(R1),F12 SD 24(R1),F16
  • SD 32(R1),F20 SD 40(R1),F24 ADDI
  • R1,R1,56
  • SD 8(R1),F28 BNE R1,R2,L

23
Limitations of Multiple-Issue Processors
  • Available ILP is limited (were not programming
    with parallelism in mind)
  • Hardware cost
  • adding more functional units is easy
  • more memory ports and register ports needed
  • dependency check needs O(n2) comparisons
  • Limitations of VLIW processors
  • Loop unrolling increases code size
  • Unfilled slots wastes bits
  • Cache miss stalls pipeline
  • Research topic scheduling loads
  • Binary incompatibility (not EPIC)

24
Limits to ILP
  • Conflicting studies of amount
  • Benchmarks (vectorized Fortran FP vs. integer C
    programs)
  • Hardware sophistication
  • Compiler sophistication
  • How much ILP is available using existing
    mechanisms with increasing HW budgets?
  • Do we need to invent new HW/SW mechanisms to keep
    on processor performance curve?
  • Intel MMX, SSE (Streaming SIMD Extensions) 64
    bit ints
  • Intel SSE2 128 bit, including 2 64-bit FP
    operations per cycle
  • Motorola AltaVec 128 bit ints and FPs
  • Supersparc Multimedia ops, etc.

25
Ideal Processor
  • Assumptions for ideal/perfect processor
  • 1. Register renaming infinite number of
    virtual registers gt all register WAW WAR
    hazards avoided
  • 2. Branch and Jump prediction Perfect gt all
    program instructions available for execution
  • 4. Memory-address alias analysis addresses are
    known. A store can be moved before a load
    provided addresses not equal
  • Also
  • unlimited number of instructions issued/cycle
    (unlimited resources)
  • perfect caches
  • 1 cycle latency for all instructions (FP ,/)
  • Programs were compiled using MIPS compiler with
    maximum optimization level

26
Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
27
More Realistic HWWindow Size and Branch Impact
  • Change from Infinite window to examine 2000 and
    issue at most 64 instructions per cycle

FP 15 - 45
Integer 6 12
IPC
Perfect Tournament BHT(512) Profile No
prediction
28
More Realistic HWImpact of Limited Renaming
Registers
  • Changes 2000 instr. window, 64 instr. issue, 8K
    2-level predictor (slightly better that
    tournament predictor)

FP 11 - 45
Integer 5 - 15
IPC
Infinite 256 128 64 32
29
More Realistic HW Memory Address Alias Impact
  • Changes 2000 instr. window, 64 instr. issue, 8K
    2-level predictor, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
Integer 4 - 9
IPC
Perfect Global/stack perfect Inspection
None
30
Realistic HW for 00 Window Size Impact
  • Assumptions Perfect disambiguation, 1K Selective
    predictor, 16 entry return stack, 64 renaming
    registers, issue as many as window

FP 8 - 45
IPC
Integer 6 - 12
Infinite 256 128 64 32
16 8 4
31
How to Exceed ILP Limits of This Study?
  • WAR and WAW hazards through memory eliminated
    WAW and WAR hazards through register renaming,
    but not in memory
  • Unnecessary dependences (compiler did not unroll
    loops so iteration variable dependence)
  • Overcoming the data flow limit value prediction,
    predicting values and speculating on prediction
  • Address value prediction and speculation predicts
    addresses and speculates by reordering loads and
    stores. Could provide better aliasing analysis

32
Workstation Microprocessors 3/2001
  • Max issue 4 instructions (many CPUs)Max rename
    registers 128 (Pentium 4) Max BHT 4K x 9
    (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size
    (OOO) 126 intructions (Pent. 4)Max Pipeline
    22/24 stages (Pentium 4)


Source Microprocessor Report, www.MPRonline.com
33
SPEC 2000 Performance 3/2001 Source
Microprocessor Report, www.MPRonline.com
34
Conclusions
  • 1985-2000 1000X performance
  • Moores Law transistors/chip gt Moores Law for
    Performance/CPU
  • Hennessy industry been following a roadmap of
    ideas known in 1985 to exploit Instruction Level
    Parallelism and (real) Moores Law to get
    1.55X/year
  • Caches, Pipelining, Superscalar, Branch
    Prediction, Out-of-order execution,
  • ILP limits To make performance progress in
    future need to have explicit parallelism from
    programmer vs. implicit parallelism of ILP
    exploited by compiler, HW?
  • Otherwise drop to old rate of 1.3X per year?
  • Less than 1.3X because of processor-memory
    performance gap?
  • Impact on you if you care about performance,
    better think about explicitly parallel
    algorithms vs. rely on ILP?

35
Tournament Predictors
  • Motivation for correlating branch predictors is
    2-bit predictor failed on important branches by
    adding global information, performance improved
  • Tournament predictors use 2 predictors, 1 based
    on global information and 1 based on local
    information, and combine with a selector
  • Hopes to select right predictor for right branch

36
Tournament Predictor in Alpha 21264
  • 4K 2-bit counters to choose from among a global
    predictor and a local predictor
  • Global predictor also has 4K entries and is
    indexed by the history of the last 12 branches
    each entry in the global predictor is a standard
    2-bit predictor
  • 12-bit pattern ith bit 0 gt ith prior branch not
    taken ith bit 1 gt ith prior branch taken
  • Local predictor consists of a 2-level predictor
  • Top level a local history table consisting of
    1024 10-bit entries each 10-bit entry
    corresponds to the most recent 10 branch outcomes
    for the entry. 10-bit history allows patterns 10
    branches to be discovered and predicted.
  • Next level Selected entry from the local history
    table is used to index a table of 1K entries
    consisting a 3-bit saturating counters, which
    provide the local prediction
  • Total size 4K2 4K2 1K10 1K3 29K
    bits!
  • (180,000 transistors)

37
of predictions from local predictor in
Tournament Prediction Scheme
38
Accuracy of Branch Prediction
  • Profile branch profile from last
    execution(static in that in encoded in
    instruction, but profile)

39
Accuracy v. Size (SPEC89)
Write a Comment
User Comments (0)
About PowerShow.com