Advanced Microarchitecture - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Microarchitecture

Description:

Advanced Microarchitecture Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 51
Provided by: iki9
Category:

less

Transcript and Presenter's Notes

Title: Advanced Microarchitecture


1
Advanced Microarchitecture
Prof. Mikko H. Lipasti University of
Wisconsin-Madison Lecture notes based on notes
by Ilhyun Kim Updated by Mikko Lipasti
2
Outline
  • Instruction scheduling overview
  • Scheduling atomicity
  • Speculative scheduling
  • Scheduling recovery
  • Complexity-effective instruction scheduling
    techniques
  • CRIB reading
  • Scalable load/store handling
  • NoSQ reading
  • Building large instruction windows
  • Runahead, CFP, iCFP
  • Control Independence
  • 3D die stacking

3
Readings
  • Read on your own
  • Shen Lipasti Chapter 10 on Advanced Register
    Data Flow skim
  • I. Kim and M. Lipasti, Understanding Scheduling
    Replay Schemes, in Proceedings of the 10th
    International Symposium on High-performance
    Computer Architecture (HPCA-10), February 2004.
  • Srikanth Srinivasan, Ravi Rajwar, Haitham
    Akkary, Amit Gandhi, and Mike Upton, Continual
    Flow Pipelines, in Proceedings of ASPLOS 2004,
    October 2004.
  • Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric
    Rotenberg, Haitham H. Akkary, Transparent
    Control Independence, in Proceedings of ISCA-34,
    2007.
  • To be discussed in class
  • T. Shaw, M. Martin, A. Roth, NoSQ Store-Load
    Communication without a Store Queue, in
    Proceedings of the 39th Annual IEEE/ACM
    International Symposium on Microarchitecture,
    2006.
  • Erika Gunadi, Mikko Lipasti CRIB Combined
    Rename, Issue, and Bypass, ISCA 2011.
  • Andrew Hilton, Amir Roth, "BOLT Energy-efficient
    Out-of-Order Latency-Tolerant execution,"
    Proceedings of HPCA 2010.
  • Loh, G. H., Xie, Y., and Black, B. 2007.
    Processor Design in 3D Die-Stacking Technologies.
    IEEE Micro 27, 3 (May. 2007), 31-48.

4
Register Dataflow
5
Instruction scheduling
  • A process of mapping a series of instructions
    into execution resources
  • Decides when and where an instruction is executed
  • Data dependence graph
  • Mapped to two FUs

6
Instruction scheduling
  • A set of wakeup and select operations
  • Wakeup
  • Broadcasts the tags of parent instructions
    selected
  • Dependent instruction gets matching tags,
    determines if source operands are ready
  • Resolves true data dependences
  • Select
  • Picks instructions to issue among a pool of ready
    instructions
  • Resolves resource conflicts
  • Issue bandwidth
  • Limited number of functional units / memory ports

7
Scheduling loop
  • Basic wakeup and select operations

broadcast the tag of the selected inst
grant 0 request 0
scheduling loop
grant 1 request 1

grant n
selected
issue to FU
request n
ready - request
Wakeup logic
Select logic
8
Wakeup and Select
9
Scheduling Atomicity
  • Operations in the scheduling loop must occur
    within a single clock cycle
  • For back-to-back execution of dependent
    instructions

10
Implication of scheduling atomicity
  • Pipelining is a standard way to improve clock
    frequency
  • Hard to pipeline instruction scheduling logic
    without losing ILP
  • 10 IPC loss in 2-cycle scheduling
  • 19 IPC loss in 3-cycle scheduling
  • A major obstacle to building high-frequency
    microprocessors

11
Scheduler Designs
  • Data-Capture Scheduler
  • keep the most recent register value in
    reservation stations
  • Data forwarding and wakeup are combined

12
Scheduler Designs
  • Non-Data-Capture Scheduler
  • keep the most recent register value in RF
    (physical registers)
  • Data forwarding and wakeup are decoupled
  • Complexity benefits
  • simpler scheduler / data / wakeup path

13
Mapping to pipeline stages
  • AMD K7 (data-capture)

Data
Data / wakeup
  • Pentium 4 (non-data-capture)

wakeup
Data
14
Scheduling atomicity non-data-capture scheduler
  • Multi-cycle scheduling loop
  • Scheduling atomicity is not maintained
  • Separated by extra pipeline stages (Disp, RF)
  • Unable to issue dependent instructions
    consecutively
  • ? solution speculative scheduling

15
Speculative Scheduling
  • Speculatively wakeup dependent instructions even
    before the parent instruction starts execution
  • Keep the scheduling loop within a single clock
    cycle
  • But, nobody knows what will happen in the future
  • Source of uncertainty in instruction scheduling
    loads
  • Cache hit / miss
  • Store-to-load aliasing
  • ? eventually affects timing decisions
  • Scheduler assumes that all types of instructions
    have pre-determined fixed latencies
  • Load instructions are assumed to have a common
    case (over 90 in general) DL1 hit latency
  • If incorrect, subsequent (dependent) instructions
    are replayed

16
Speculative Scheduling
  • Overview
  • Unlike the original Tomasulos algorithm
  • Instructions are scheduled BEFORE actual
    execution occurs
  • Assumes instructions have pre-determined fixed
    latencies
  • ALU operations fixed latency
  • Load operations assumes DL1 latency (common
    case)

17
Scheduling replay
  • Speculation needs verification / recovery
  • Theres no free lunch
  • If the actual load latency is longer (i.e. cache
    miss) than what was speculated
  • Best solution (disregarding complexity) replay
    data-dependent instructions issued under load
    shadow

Cache miss detected
18
Wavefront propagation
  • Speculative execution wavefront
  • speculative image of execution (from schedulers
    perspective)
  • Both wavefront propagates along dependence edges
    at the same rate (1 level / cycle)
  • the real wavefront runs behind the speculative
    wavefront
  • The load resolution loop delay complicates the
    recovery process
  • scheduling miss is notified a couple of clock
    cycles later after issue

19
Load resolution feedback delay in instruction
scheduling
Broadcast / wakeup
Select
N
N
Time delay between sched and feedback
Dispatch / Payload
N-4
Execution
N-1
RF
Misc.
N-3
recover instructions in this path
N-2
  • Scheduling runs multiple clock cycles ahead of
    execution
  • But, instructions can keep track of only one
    level of dependence at a time (using source
    operand identifiers)

20
Issues in scheduling replay
Exe
Sched / Issue
checker
cache miss signal
cycle n
cycle n1
cycle n2
cycle n3
  • Cannot stop speculative wavefront propagation
  • Both wavefronts propagate at the same rate
  • Dependent instructions are unnecessarily issued
    under load misses

21
Requirements of scheduling replay
  • Propagation of recovery status should be faster
    than speculative wavefront propagation
  • Recovery should be performed on the transitive
    closure of dependent instructions
  • Conditions for ideal scheduling replay
  • All mis-scheduled dependent instructions are
    invalidated instantly
  • Independent instructions are unaffected
  • Multiple levels of dependence tracking are needed
  • e.g. Am I dependent on the current cache miss?
  • Longer load resolution loop delay ? tracking more
    levels

load miss
22
Scheduling replay schemes
  • Alpha 21264 Non-selective replay
  • Replays all dependent and independent
    instructions issued under load shadow
  • Analogous to squashing recovery in branch
    misprediction
  • Simple but high performance penalty
  • Independent instructions are unnecessarily
    replayed

23
Position-based selective replay
  • Ideal selective recovery
  • replay dependent instructions only
  • Dependence tracking is managed in a matrix form
  • Column load issue slot, row pipeline stages

24
Low-complexity scheduling techniques
  • FIFO (Palacharla, Jouppi, Smith, 1996)
  • Replaces conventional scheduling logic with
    multiple FIFOs
  • Steering logic puts instructions into different
    FIFOs considering dependences
  • A FIFO contains a chain of dependent instructions
  • Only the head instructions are considered for
    issue

25
FIFO (contd)
  • Scheduling example

26
FIFO (contd)
  • Performance
  • Comparable performance to the conventional
    scheduling
  • Reduced scheduling logic complexity
  • Many related papers on clustered microarchitecture

27
CRIB Reading
  • Erika Gunadi, Mikko Lipasti CRIB Combined
    Rename, Issue, and Bypass, ISCA 2011.
  • Goals
  • Match OOO performance per cycle
  • Match OOO frequency
  • Match OOO area
  • Reduce power significantly
  • Eliminate pipelines, latches, rename structures,
    issue logic

28
CRIB Data Movement
Front-End
ROB
RS
CRIB
PRF
ARF
Bypass
CRIB
ALU
In-place execution
Physical Register File - style
29
In-place Execution
  • First proposed by Ultrascalar 1999
  • Place instructions in execution stations
  • Route operands to instructions
  • Goal massively wide issue
  • Power constraints not even on the horizon
  • CRIB in-place execution as enabler
  • Eliminate pipelined execution lanes, multiported
    RF, renaming, wakeup select, clock loads
  • Enable efficient speculation recovery
  • Enable variable execution latency tolerance

30
CRIB Concept
Next Entry
C
C
Source1
Source2
Destination
C
C
ALU
C
R0
R1
R2
R3
C
C
C
C
Previous Entry
WE
  • Data values propagate combinationally (no
    latches)
  • Completion bit propagates synchronously (latched)
  • Instructions stay until all are finished
  • When all are finished, latch data into ARF latches

31
Renaming in CRIB
C
Cyc 1
Add R2, R0, R0 Sub R3, R0, R2 Add R2, R2, R3 Add
R2, R0, R3
C
Cyc 3
Source1
Source2
Destination
C
Cyc 2
C
Cyc 1
R0
R1
R2
R3
C
C
C
C
  • All the connections forms in parallel after
    dispatch
  • Dependency is solved by the positional renaming
  • Instructions issue subject to the readiness of
    its operands

32
Scaling Up CRIB
LQ Bank
LQ Bank
Front End
ARF
SQ
SQ
Mult/Div
ARF
ARF
LQ Bank
LQ Bank
SQ
SQ
ARF
Cache Port
  • Multiple CRIB partitions maintained as circular
    queue
  • Only head ARF has committed state
  • Other latches are left transparent

33
Data Propagation across partitions
  • Transparent latches for data
  • Regular latches for complete bit
  • Data values take one additional cycle to travel
    to the next partition

R0
C
R1
C
R2
C
R3
C
Cycle 3
Cycle 2
R0
C
R1
C
R2
C
R3
C
Cycle 1
Cycle 0
R0
C
R1
C
R2
C
R3
C
34
CRIB Pipeline Diagram
data linking
dependence linking
WB
Cmt
Rnm
Disp
Issue
RF
WB
OoO
dependence / data linking
Int
Disp
WB
A-Gen
Load
Load
CRIB
  • Fewer pipe stages
  • Remove rename stage from front-end
  • Remove issue and RF from middle
  • Combine dependence and data linking

35
Load-Store Ordering
LQ
ADD R2, R3, R1
Data
Addr
ADD R2, R1, R1
LD R3, R1, R2
SQ
ST R0, R1, 1
R0
R1
R2
R3
  • Loads/stores are ordered aggressively
  • Recovery replay in place
  • No prediction needed recovery is cheap

36
Branch Misprediction
Instruction 3
NOP
Instruction 2
NOP
branch mispredict
Instruction 0
R0
R1
R2
R3
  • Mispredicted branch drives a global signal up the
    CRIB
  • Forces younger instructions to transform into
    NOPs
  • Simpler than checkpointing or ROB unrolling

37
CRIB Findings
  • CRIB proposal appears promising
  • Competitive IPC and area
  • Dramatic power reductions
  • Over baseline1 (Bobcat)
  • 45 less energy per instruction
  • 20-30 better IPC
  • Over baseline2 (Nehalem)
  • 75 less energy per instruction
  • INT IPC slightly better, FP IPC slightly worse

38
CRIB Summary
  • Instructions are inserted from front end
  • Instructions inside CRIB execute subject to
    readiness of operands
  • Data propagates without latches
  • Complete bit ensures that data propagate
    synchronously
  • A CRIB retires when all instructions done
    executing
  • When a CRIB retires, data are latched in the ARF

39
Memory Dataflow
40
Scalable Load/Store Queues
  • Load queue/store queue
  • Large instruction window many loads and stores
    have to be buffered (25/15 of mix)
  • Expensive searches
  • positional-associative searches in SQ,
  • associative lookups in LQ
  • coherence, speculative load scheduling
  • Power/area/delay are prohibitive

41
Store Queue/Load Queue Scaling
  • Multilevel queues
  • Bloom filters (quick check for independence)
  • Eliminate associative load queue via replay
  • Cain 2004
  • Issue loads again at commit, in order
  • Check to see if same value is returned
  • Filter load checks for efficiency
  • Most loads dont issue out of order (no
    speculation)
  • Most loads dont coincide with coherence traffic

42
SVW and NoSQ
  • Store Vulnerability Window (SVW)
  • Assign sequence numbers to stores
  • Track writes to cache with sequence numbers
  • Efficiently filter out safe loads/stores by only
    checking against writes in vulnerability window
  • NoSQ
  • Rely on load/store alias prediction to satisfy
    dependent pairs
  • Use SVW technique to check

43
Store/Load Optimizations
  • Weakness predictor still fails
  • Machine should fail gracefully, not fall off a
    cliff
  • Glass jaw
  • Several other concurrent proposals
  • DMDC, Fire-and-forget,

44
Key Challenge MLP
  • Tolerate/overlap memory latency
  • Once first miss is encountered, find another one
  • Naïve solution
  • Implement a very large ROB, IQ, LSQ
  • Power/area/delay make this infeasible
  • Build virtual instruction window

45
Runahead
  • Use poison bits to eliminate miss-dependent load
    program slice
  • Forward load slice processing is a very old idea
  • Massive Memory Machine Garcia-Molina et al. 84
  • Datascalar Burger, Kaxiras, Goodman 97
  • Runahead proposed by Dundas, Mudge 97
  • Checkpoint state, keep running
  • When miss completes, return to checkpoint
  • May need runahead cache for store/load
    communication

46
Waiting Instruction Buffer
  • Lebeck et al. ISCA 2002
  • Capture forward load slice in separate buffer
  • Propagate poison bits to identify slice
  • Relieve pressure on issue queue
  • Reinsert instructions when load completes
  • Very similar to Intel Pentium 4 replay mechanism
  • But not publicly known at the time

47
Continual Flow Pipelines
  • Srinivasan et al. 2004
  • Slice buffer extension of WIB
  • Store operands in slice buffer as well to free up
    buffer entries on OOO window
  • Relieve pressure on rename/physical registers
  • Applicable to
  • data-capture machines (Intel P6) or
  • physical register file machines (Pentium 4)
  • Also extended to in-order machines (iCFP)
  • Challenge how to buffer loads/stores
  • Reading Hilton Roth, BOLT, HPCA 2010

48
Instruction Flow
49
Transparent Control Independence
  • Control flow graph convergence
  • Execution reconverges after branches
  • If-then-else constructs, etc.
  • Can we fetch/execute instructions beyond
    convergence point?
  • How do we resolve ambiguous register and memory
    dependences
  • Writes may or may not occur in branch shadow
  • TCI employs CFP-like slice buffer to solve these
    problems
  • Instructions with ambiguous dependences buffered
  • Reinsert them the same way forward load miss
    slice is reinserted
  • Best CI proposal to date, but still very
    complex and expensive, with moderate payback

50
Summary of Advanced Microarchitecture
  • Instruction scheduling overview
  • Scheduling atomicity
  • Speculative scheduling
  • Scheduling recovery
  • Complexity-effective instruction scheduling
    techniques
  • CRIB reading
  • Scalable load/store handling
  • NoSQ reading
  • Building large instruction windows
  • Runahead, CFP, iCFP
  • Control Independence
Write a Comment
User Comments (0)
About PowerShow.com