Superscalar Techniques - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

Superscalar Techniques

Description:

... have some additional penalty to start fetching instructions from the new target ... For operand fetch - occurs during decode/dispatch stage. Destination allocate ... – PowerPoint PPT presentation

Number of Views:218
Avg rating:3.0/5.0
Slides: 105
Provided by: sur78
Category:

less

Transcript and Presenter's Notes

Title: Superscalar Techniques


1
Chapter 5
  • Superscalar Techniques

2
3 major issues
  • Instruction flow
  • Register data flow
  • Memory data flow
  • Major challenges
  • Branch instruction processing
  • ALU instructions
  • Load/store instructions

3
Instruction Flow
  • Control dependences
  • Control Flow Graph (CFG)
  • Nodes - instructions
  • Edges control flow
  • Branches
  • Jumps
  • Calls, Returns

4
Program Control Flow
5
Performance Degradation due to branches
  • In a scalar processor, may be 3 cycles
  • In a superscalar processor cycles, 3N cycles in
    an N-wide processor
  • Conditional branches 2 issues
  • Condition resolution
  • Target address generation

6
Branch Penalty
  • Unconditional branches
  • Only target needs to be resolved
  • But addressing mode of branch
  • PC relative
  • Indirect branches

7
Disruption of Sequential Control Flow by Branch
Instructions
  • 3 cycle penalty for
  • Conditional branches
  • (D, Dis, and Ex stages)
  • If 4-way superscalar,
  • Penalty eqvt to
  • 43 cycles
  • Cond branches-
  • Bottleneck could be
  • Condition resolution or
  • Target address calculation

8
Branch Target Address Generation Penalties
  • Target address calculation penalty varies for
    different types of branches
  • PC relative 1
  • Register indirect 2
  • Register indirect with offset -3

9
Branch Condition Resolution Penalties
  • Different for different types of branches
  • If CC (condition code) used 2 cycles
  • If GP register to be compared 3 cycles

10
Static Branch Prediction
  • No hardware to predict branch at run time
  • Always Not Taken
  • Backwards branch always taken
  • Compiler assisted branch prediction
  • Depending on opcode
  • Calls
  • Returns
  • Loops

11
Dynamic Branch Prediction
  • Static Branch Prediction gets 70-80 correctness
  • Dynamic branch predictors achieve 80-95
    correctness in branch direction prediction
  • Problems with static prediction
  • A branch which is taken during first half of
    program and not taken during second half

12
FSM Model
  • History based FSMs

13
Branch Prediction Techniques
  • Branch target speculation
  • Branch condition speculation
  • Branch target speculation
  • BTB (branch target buffer)
  • Branch instruction address (BIA)
  • Branch target address (BTA)
  • BTB is like a cache

14
Branch Target Speculation
15
Branch Condition Speculation
  • Hardware to predict Taken/Not-Taken
  • Always predict Not Taken
  • 50 branches T
  • Is predicted taken equal to predicted NT?
  • No, because taken branches will have some
    additional penalty to start fetching instructions
    from the new target

16
Branch Condition Speculation
  • Let us say 3 cycles to fill pipe with
    instructions from the new target
  • Always not taken no stalls per correct
    prediction
  • Always taken 3 stalls per correct prediction
  • Assume 4 cycles per wrong prediction
  • Assuming 20 branches,
  • CPI 1 1 .054
  • CPI 2 1 .054 .053

17
History-based Branch Prediction
  • 2-bit branch predictor
  • If count is 2 or 3 predict taken
  • If count is 0 or 1 predict not taken
  • A 2-bit counter for each branch?
  • An array of 2-bit counters
  • How to access this array?
  • How to index into it?
  • Organized like a cache

18
History-based Branch Prediction
  • 2-bit branch predictor

19
Optimal 2-bit Branch Predictors
Counter a 2-bit saturating counter
20
Two Aspects of Branch Prediction (a) Branch
Speculation (b) Branch Validation/Recovery
21
Instead of a single unified BTB, BTAC 64
entries, FA BHT 512 entries, DM FA (Fetch
Address) sent To both. BTAC 1 cycle BHT 2
cycles
Branch Prediction in PowerPC 604
22
Power PC 604 contd
BTAC 64 entries, FA BHT 512 entries, DM BTAC
1 cycle BHT 2 cycles 4 entries in Branch
Reservation Station Up to 4 speculative branch
instructions 2-bit tag used to identify
speculative instructions After a branch
resolves, speculative instrns made
non-speculative or invalidated.
23
Two-level Branch Prediction of Yeh and Patt
24
Correlated Branch Predictor with Global BHSR
25
Correlated Branch Predictor with Individual BHSRs
26
gshare Correlated Branch Predictor
  • Scott McFarling

27
Two-level predictors (contd)
  • BHSR G- global, P indiv
  • PHT g global, p-indiv, s- shared
  • A Adaptive
  • 3 predictors with 97 prediction
  • GAg 1 BHSR (18 bits), 1 PHT 218 X 2 bits
  • PAg 512 X 4-way SA BHSRs of 12 bits, 1 PHT of
    212 X 2 bits
  • PAs 512 X 4-way BHSRs of 6 bits, 512 PHTs of 26
    X 2 bits (PHT is DM)

28
Two-level predictors (contd)
  • 2-level predictors give 95 accuracy
  • Traditional predictors give 90
  • Processors starting with PentiumPro and AMD Nx686
    onwards use 2-level predictors

29
3 major issues
  • Instruction flow
  • Register data flow
  • Memory data flow
  • Major challenges
  • Branch instruction processing
  • ALU instructions
  • Load/store instructions

30
Register Data Flow Techniques
  • Efficient execution of ALU type instructions in
    the execution core
  • Real work
  • View of memory and control flow as supporting
  • Memory instructions provide data
  • Control flow instructions provide the right
    instructions
  • Concept of useful and overhead instructions

31
ALU Instructions
  • Ri lt- Fn(Rj,Rk)
  • Source registers Rj, Rk
  • Destination register Ri
  • Operation Fn
  • If source operands in Rj or Rk are not available
    - true data dependency
  • If destination register Ri not available, anti or
    output dependence

32
Register Reuse and False Dependencies
  • Anti and output dependence are due to register
    reuse
  • Called false dependencies
  • Register reuse is also called register recycling
  • Compiler performs code generation and register
    allocation

33
Register Allocation
  • Single assignment code code in which each
    symbolic register is used to store one value and
    written only once
  • Practical ISAs have limited number of registers
  • Register coloring algorithm
  • Register live-range

34
Register Renaming
  • Dynamically assign different names to the
    multiple definitions of an architected register
  • r4 lt- r3 r2 r4 lt- r3 r2
  • r6 lt- r4 r5 r6 lt- r4 r5
  • r4 lt- r6 r7 r8 lt- r6 r7
  • (a) (b)

35
Register Renaming
  • Single assignment is effectively done for
    instructions that are in flight
  • Eliminate all false dependences (anti and output)
    between the instructions in flight
  • Common techniques
  • Rename Register File (RRF) Architected register
    file (ARF) Mapping table
  • Reorder Buffer (ROB)

36
Rename Register File (RRF)
37
Register Renaming Tasks
  • Source Read
  • For operand fetch - occurs during decode/dispatch
    stage
  • Destination allocate
  • During decode subtasks are set busy bit, assign
    tag, update map table
  • Register update
  • At the end of execution update RRF first and
    then ARF

38
Source Read 3 possibilities
  • Busy bit in ARF not set
  • ARF contains the operand
  • Busy bit in ARF set
  • there is a pending write to the ARF register
    content of ARF is stale map table used to get
    RRF tag to index into RRF.
  • Valid bit of RRF set source operand is in RRF
  • Valid bit of RRF not set RRF has a pending
    update tag forwarded to reservation station
    instead of source operand R. S will get data
    later by forwarding.

39
Register Renaming Tasks
40
Integrating RRF and ARF
  • Although discussion so far was with separate RRF
    and ARF, no need
  • A single register file with number of entries
    equal to RRF ARF sufficient
  • Pooled register file
  • Each physical register can be flexibly assigned
    to be AR (architected register) or RR (rename
    register)
  • Pooled register file no need to copy result for
    final update

41
Floating-Point Unit (FPU) Register Renaming
  • Example for Pooled Register File

42
Pooled Register File in IBM RS/6000
  • 40 physical registers, 32 architected or logical
    registers
  • Mapping table contains 32 entries each 6 bit
  • 6 bits specify the physical register
  • Rename pipe stage contains map table, two
    circular queues and control logic
  • Map table must have 4 ports due to fused multiply
    add which needs 3 registers

43
Pooled Register File in IBM RS/6000
  • First queue - Free list (FL)
  • Second queue Pending target return queue (PTRQ)
  • FL contains registers available for renaming
  • PTRQ contains registers already in use for
    renaming
  • Figure shows initial condition with PTRQ empty

44
Pooled Register File in IBM RS/6000
  • Map table contains the latest mapping of each
    logical register.
  • When a new instruction needs the register as
    destination, current entry of map table is pushed
    into PTRQ
  • Subsequent instructions that need this register
    as source will receive the new physical register
    specifier as the source

45
True Data Dependencies and the Data flow limit
  • RAW dependencies cannot be eliminated by renaming
  • Producer consumer relationship between 2
    instructions
  • Imposes serialization between 2 dependent
    instructions
  • Data dependence graph (DDG) used to represent
    such true dependences

46
FFT Code Fragment
47
Data Flow Graph of the Code Fragment
48
DDG
  • Instructions are nodes
  • Edges are dependences
  • Edges can be marked with latencies
  • Critical path of a DFG longest dependence chain
    measured in terms of total cumulative latency
  • Data flow limit
  • 12 cycles for FFT example

49
Tomasulos Algorithm
  • IBM 360/91 FP unit had dynamic scheduling
  • Tomasulo 1967
  • Several contemporary superscalar out of order
    processors draw a lot of ideas from IBM 360/91

50
Original Design of IBM 360 FPU
51
Modified Design of IBM 360 FPU
52
Use of Tag Fields
53
Example Instruction sequence
  • W R4lt- R0R8
  • X R2 lt- R0R4
  • Y R4 lt- R4 R8
  • Z R8 lt- R4 R2

54
Tomasulos Algorithm
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2 R06.0
R87.8
W finishes _at_2 Result Broadcast Not written In
R4
55
Tomasulos Algorithm
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2 R06.0
R87.8
_at_4 Y finishes Updates R4
_at_5 X finishes Updates R2
Z starts _at_6
56
Data Flow Graphs of Example Instruction sequence
W R4lt- R0R8 X R2 lt- R0R4 Y R4 lt- R4 R8 Z
R8 lt- R4 R2
57
Dynamic Execution Core
  • Out of order execution core also called dynamic
    execution core
  • Tries to achieve data flow limit
  • 3 steps in dynamic execution
  • Instruction dispatching
  • Instruction execution
  • Instruction completion

58
Instruction Dispatch Phase
  • Rename destination registers
  • Allocate reservation station and ROB entries
  • Advance instructions from the dispatch buffer to
    the reservation stations (RS)
  • ROB entries allocated in program order
  • Rename register, RS entry, and ROB entry must be
    available to be able to dispatch an instruction

59
Instruction Execution Phase
  • Issue Ready Instructions
  • Execute issued instructions
  • Forward results
  • RS is responsible for identifying ready
    instructions
  • Ready means all source operands are available
  • Waiting instrns continually monitor the buses for
    operands

60
Micro-Dataflow Engine for dynamic Execution
61
Instruction Execution Phase contd
  • Monitor the buses for operands using tags
  • Result buses come with results and tags
  • When tag matches, result captured from result bus
    into RS entry
  • Result is also going into register file
  • When all operands available, instruction ready
    for issue
  • If multiple instructions ready, a scheduling
    algorithm (eg oldest, most critical)

62
Instruction Execution Phase contd
  • FUs have varying latencies
  • Single cycle latency FU
  • Multi-cycle (fixed) latency FU
  • Multi-cycle variable latency
  • When instruction finishes, destination tag and
    result broadcast on result bus
  • Result bus also called forwarding bus
  • Tag is specifier of the rename register assigned
    for destination of this instruction

63
Instruction Execution Phase contd
  • When instruction finishes, destination tag and
    result broadcast on result bus
  • All dependent instructions waiting in the RS
    will trigger a tag match and will latch in the
    broadcasted result
  • This instruction forwarding does not need writing
    result into register and reading from there
  • The destination tag is also used to update RRF

64
Instruction Execution Phase contd
  • RS entry usually deallocated when instruction
    issued
  • Another trailing instruction can now be
    dispatched into the RS
  • RS saturation can cause instruction stalls
  • RS helps to achieve data flow limit
  • RS helps to eliminate WAR dependency if it copies
    operands

65
Reservation Station ROB
  • RS and ROB are critical components of out of
    order execution
  • Issues associated with management of these
    component determine efficiency of superscalar
    execution
  • Loading and unloading entries of RS and ROB
    should be managed well

66
Reservation Station Structure
  • 3 tasks of RS - Dispatching, Waiting and Issuing
  • RS fields
  • Operand 1, Valid field for Op1
  • Operand 2, Valid for Operand 2
  • Busy for entire RS entry
  • Ready to indicate ready to be issued all source
    operands available

67
Reservation Station Mechanisms
68
Reservation Station contd
  • Dispatching into RS 3 steps
  • Select a free (i.e not busy) RS entry
  • Load operands/tags into the selected entry
  • Set busy bit of that entry
  • Instructions with pending operands are not ready
  • Tag match occurs and instruction receives all
    operands -gt instruction wake up

69
Wake up and Select Logic
  • Wake up logic checks for tag match and sets ready
    bit of instructions when all operands received
  • Associative operation involved because a tag on
    the bus needs to be compared against all
    instructions waiting in RS
  • Select logic selects an instruction to be issued
  • Scheduling heuristic

70
ROB design issues
  • ROB contains all instructions in flight
  • Does RS contain all instructions in flight?
  • Each instrn can be waiting for execution, in
    execution, waiting for completion after execution
  • Status bits to indicate these
  • Bit to indicate whether instruction is in
    speculative path
  • When branch resolved, speculative -gt
    nonspeculative
  • Only non speculative instrns can be retired

71
ROB organization
  • ROB fields
  • ROB managed as circular queue
  • Head pointer and tail pointer
  • Tail pointer advanced when ROB entry allocated at
    dispatch
  • Dispatch bandwidth number of entries allocated
    per cycle
  • Instrns completed from head of queue
  • Completion bandwidth

72
Reorder Buffer Entry and Org
73
ROB issues contd
  • Completion bandwidth determined by routing
    network and ports available for register
    writeback
  • Data copying from ROB /RRF to ARF
  • RS ROB can be one structure called RUU
    (Register Update Unit) or instruction window

74
Dynamic Instruction Scheduler
  • Dynamic scheduling involves instruction window
    (RSROB), wake up and select logic
  • Instruction scheduler with data capture
  • Scheduler without data capture

75
Instruction Scheduling without and with data
capture
  • (a) with data capture (b)without data capture

76
Dynamic Instruction Scheduler
  • Instruction scheduler with data capture
  • RS copies operands or tags at dispatch
  • Scheduler without data capture
  • no copying of operands, only tags
  • Scheduler performs tag match to wake up ready
    instructions
  • Operands obtained from RF just prior to execution
  • Many new processors do it this way

77
Other Register Data Flow techniques
  • Is data flow limit fundamental?
  • Value prediction
  • Lipasti, Wilkerson, Shen
  • Predict load values
  • Values loaded by many load instructions are quite
    predictable
  • Value locality

78
Memory Data Flow techniques
  • Not all data can be in registers
  • Spill code from compilers leads to load/store in
  • Dynamic scheduling (out of ordering) of load and
    store instructions is important
  • Long latency of loads
  • A load that is a cache miss should not block
    another later load which could go

79
Memory Accessing Instructions
  • Steps in memory instructions
  • Memory address generation
  • Address not in instrn
  • Address computed from regoffset
  • Memory address translation
  • To support Virtual memory
  • Memory sharing and protection issues
  • Data memory accessing

80
Processing of Load/Store Instructions
81
Load/Store Pipes
  • Address generation, address translation, memory
    access stages 3 stages
  • Look at Fig 5-30 L/S Unit
  • First pipe stage add register with offset
  • Second pipe stage TLB access, if TLB miss, page
    table access, even possible page fault
  • Page fault typically handled as exception

82
Load/Store Pipes
  • 3rd pipe stage
  • Load access data memory
  • Cache miss possible
  • Store instruction can be considered as finished
    at the end of second stage
  • Data in register or ROB moved to store buffer
  • Store buffer is a FIFO buffer
  • Store instruction can be architecturally complete
    but not retired to memory
  • Only non-speculative stores are retired
  • When exceptions occur, stores until the exception
    retired, rest flushed

83
Ordering of memory accesses - L/S dependencies
  • RAW
  • WAR
  • RAR
  • WAW
  • These memory dependencies must be enforced for
    program correctness

84
Ordering of memory access
  • Total ordering of loads and stores is safe but
    not required
  • Total ordering is very conservative
  • Independent loads could be allowed to go ahead of
    pending stores
  • A load might be stuck with a cache miss other
    loads ahead of it could be allowed to proceed
  • If a load from an address that is yet to be
    stored, load cannot go forward
  • Load could be serviced from store buffer if
    addresses known

85
Memory aliasing
  • If a load and store refers to the same memory
    location, there is an aliasing or collision.
  • Consider
  • store 4, 100(3)
  • ..
  • load 6, 200(2)
  • Is the load independent of the store?

86
Memory aliasing
  • store 4, 100(3)
  • ..
  • load 6, 200(2)
  • Is the load is independent of the store?
  • What if 3300 and 2200?
  • Cannot be sure of the independence until address
    calculated

87
DAXPY Example
  • From LINPACK benchmark

88
DAXPY Example
  • Can you reorder loads and stores here?

89
DAXPY dependencies
  • Dependencies inside an iteration
  • Dependencies between iterations
  • Load instructions from a future iteration could
    go ahead of store instructions from current
    iteration
  • Loads could be allowed to go OOO without toomuch
    difficulty
  • Stores are never usually allowed to go OOO

90
Load Bypassing and Load Forwarding
  • Load bypassing Allow load instructions to jump
    ahead of other preceding store instructions if
    the load address does not alias with the
    preceding stores i.e. no memory dependencies with
    preceding store
  • Load forwarding If a trailing load aliases with
    preceding store, if the load can receive its data
    from the store via forwarding

91
Early Execution of Load Instructions
  • (a) Load bypassing (b) Load Forwarding

92
Mechanisms for Load/Store Processing
93
Illustration of Load Bypassing
94
Illustration of Load Forwarding
95
Fully Out-of-Order Issuing and Execution
96
Memory dependence prediction
  • Memory dependence checking
  • Store p store 4, 100(3)
  • Load q load 5, 200(2)
  • If addresses unknown, conservatively we just
    assume that pq
  • Other option speculate whether pq, and
    proceed. If later your prediction was wrong,
    correct the misspeculation

97
Memory dependence prediction
  • Memory disambiguation
  • Disambiguate the addresses
  • Memory disambiguation techniques can make a big
    difference in performance
  • When multiple issue is combined with memory
    dependence issues, schemes can be very complex

98
Multiported memories
  • Superscalar means multiple instruction issue
    hence multiple loads could be happening in same
    cycle
  • High cache and memory bandwidth required to
    support an aggressive processor
  • Multiple ports on caches help
  • Multiple load/store pipes required

99
Non blocking memories
  • When a cache miss, should following hits be
    serviced?
  • Will a cache freeze on a miss?
  • Blocking and non-blocking caches
  • Superscalar execution needs non-blocking caches
    otherwise poor performance
  • Lw1 -----cache miss
  • Lw2 ------ will hit in cache
  • Lw3 ----- will hit in cache

100
MSHRs
  • MSHR Miss Status Holding register
  • Is the hardware support needed for handling
    nonblocking misses
  • MSHRs of O(4) can allow up to 4 outstanding
    misses
  • Needs a Missed load queue
  • Missed Load queue holds the missing load when
    data arrives, exits missed load queue and finishes

101
Dual-ported and Nonblocking Data Cache
  • Dual ported and non-blocking

102
Prefetching
  • Hardware and Software Prefetching
  • Prefetching Cache
  • Anticipates future misses and triggers these
    missed early in the hope of bringing the data
    before actual load happens
  • Memory reference prediction table
  • Prefetch queue

103
Software Prefetching
  • Compiler inserts prefetching instructions to
    trigger prefetching of data into cache very early
  • Actual load instruction will hit if prefetch
    happens in time
  • Loop Unrolling
  • Load hoisting
  • Software pipelining
  • Explicit Software Prefetching

104
Prefetching Data Cache
Write a Comment
User Comments (0)
About PowerShow.com