Advanced Register Dataflow - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Advanced Register Dataflow

Description:

Shen & Lipasti Chapter 10 on Advanced Register Data Flow skim ... Implication of scheduling atomicity. Pipelining is a standard way to improve clock frequency ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 34
Provided by: ikim
Category:

less

Transcript and Presenter's Notes

Title: Advanced Register Dataflow


1
Advanced Register Dataflow
Prof. Mikko H. Lipasti University of
Wisconsin-Madison Lecture notes based on notes
by Ilhyun Kim Updated by Mikko Lipasti
2
Outline
  • Instruction scheduling overview
  • Scheduling atomicity
  • Speculative scheduling
  • Scheduling recovery
  • Complexity-effective instruction scheduling
    techniques

3
Readings
  • Read on your own
  • Shen Lipasti Chapter 10 on Advanced Register
    Data Flow skim
  • I. Kim and M. Lipasti, Understanding Scheduling
    Replay Schemes, in Proceedings of the 10th
    International Symposium on High-performance
    Computer Architecture (HPCA-10), February 2004.
  • To be discussed in class
  • Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary,
    Amit Gandhi, and Mike Upton, Continual Flow
    Pipelines, in Proceedings of ASPLOS 2004,
    October 2004.
  • T. Shaw, M. Martin, A. Roth, NoSQ Store-Load
    Communication without a Store Queue, in
    Proceedings of the 39th Annual IEEE/ACM
    International Symposium on Microarchitecture,
    2006.
  • Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric
    Rotenberg, Haitham H. Akkary, Transparent
    Control Independence, in Proceedings of ISCA-34,
    2007.

4
We are talking about
  • Register data flow

5
Instruction scheduling
  • A process of mapping a series of instructions
    into execution resources
  • Decides when and where an instruction is executed
  • Data dependence graph
  • Mapped to two FUs

6
Instruction scheduling
  • A set of wakeup and select operations
  • Wakeup
  • Broadcasts the tags of parent instructions
    selected
  • Dependent instruction gets matching tags,
    determines if source operands are ready
  • Resolves true data dependences
  • Select
  • Picks instructions to issue among a pool of ready
    instructions
  • Resolves resource conflicts
  • Issue bandwidth
  • Limited number of functional units / memory ports

7
Scheduling loop
  • Basic wakeup and select operations

broadcast the tag of the selected inst
grant 0 request 0
scheduling loop
grant 1 request 1

grant n
selected
issue to FU
request n
ready - request
Wakeup logic
Select logic
8
Wakeup and Select
9
Scheduling Atomicity
  • Operations in the scheduling loop must occur
    within a single clock cycle
  • For back-to-back execution of dependent
    instructions

10
Implication of scheduling atomicity
  • Pipelining is a standard way to improve clock
    frequency
  • Hard to pipeline instruction scheduling logic
    without losing ILP
  • 10 IPC loss in 2-cycle scheduling
  • 19 IPC loss in 3-cycle scheduling
  • A major obstacle to building high-frequency
    microprocessors

11
Scheduler Designs
  • Data-Capture Scheduler
  • keep the most recent register value in
    reservation stations
  • Data forwarding and wakeup are combined

12
Scheduler Designs
  • Non-Data-Capture Scheduler
  • keep the most recent register value in RF
    (physical registers)
  • Data forwarding and wakeup are decoupled
  • Complexity benefits
  • simpler scheduler / data / wakeup path

13
Mapping to pipeline stages
  • AMD K7 (data-capture)

Data
Data / wakeup
  • Pentium 4 (non-data-capture)

wakeup
Data
14
Scheduling atomicity non-data-capture scheduler
  • Multi-cycle scheduling loop
  • Scheduling atomicity is not maintained
  • Separated by extra pipeline stages (Disp, RF)
  • Unable to issue dependent instructions
    consecutively
  • ? solution speculative scheduling

15
Speculative Scheduling
  • Speculatively wakeup dependent instructions even
    before the parent instruction starts execution
  • Keep the scheduling loop within a single clock
    cycle
  • But, nobody knows what will happen in the future
  • Source of uncertainty in instruction scheduling
    loads
  • Cache hit / miss
  • Store-to-load aliasing
  • ? eventually affects timing decisions
  • Scheduler assumes that all types of instructions
    have pre-determined fixed latencies
  • Load instructions are assumed to have a common
    case (over 90 in general) DL1 hit latency
  • If incorrect, subsequent (dependent) instructions
    are replayed

16
Speculative Scheduling
  • Overview
  • Unlike the original Tomasulos algorithm
  • Instructions are scheduled BEFORE actual
    execution occurs
  • Assumes instructions have pre-determined fixed
    latencies
  • ALU operations fixed latency
  • Load operations assumes DL1 latency (common
    case)

17
Scheduling replay
  • Speculation needs verification / recovery
  • Theres no free lunch
  • If the actual load latency is longer (i.e. cache
    miss) than what was speculated
  • Best solution (disregarding complexity) replay
    data-dependent instructions issued under load
    shadow

Cache miss detected
18
Wavefront propagation
  • Speculative execution wavefront
  • speculative image of execution (from schedulers
    perspective)
  • Both wavefront propagates along dependence edges
    at the same rate (1 level / cycle)
  • the real wavefront runs behind the speculative
    wavefront
  • The load resolution loop delay complicates the
    recovery process
  • scheduling miss is notified a couple of clock
    cycles later after issue

19
Load resolution feedback delay in instruction
scheduling
Broadcast / wakeup
Select
N
N
Time delay between sched and feedback
Dispatch / Payload
N-4
Execution
N-1
RF
Misc.
N-3
recover instructions in this path
N-2
  • Scheduling runs multiple clock cycles ahead of
    execution
  • But, instructions can keep track of only one
    level of dependence at a time (using source
    operand identifiers)

20
Issues in scheduling replay
Exe
Sched / Issue
checker
cache miss signal
cycle n
cycle n1
cycle n2
cycle n3
  • Cannot stop speculative wavefront propagation
  • Both wavefronts propagate at the same rate
  • Dependent instructions are unnecessarily issued
    under load misses

21
Requirements of scheduling replay
  • Propagation of recovery status should be faster
    than speculative wavefront propagation
  • Recovery should be performed on the transitive
    closure of dependent instructions
  • Conditions for ideal scheduling replay
  • All mis-scheduled dependent instructions are
    invalidated instantly
  • Independent instructions are unaffected
  • Multiple levels of dependence tracking are needed
  • e.g. Am I dependent on the current cache miss?
  • Longer load resolution loop delay ? tracking more
    levels

load miss
22
Scheduling replay schemes
  • Alpha 21264 Non-selective replay
  • Replays all dependent and independent
    instructions issued under load shadow
  • Analogous to squashing recovery in branch
    misprediction
  • Simple but high performance penalty
  • Independent instructions are unnecessarily
    replayed

23
Position-based selective replay
  • Ideal selective recovery
  • replay dependent instructions only
  • Dependence tracking is managed in a matrix form
  • Column load issue slot, row pipeline stages

24
Low-complexity scheduling techniques
  • FIFO (Palacharla, Jouppi, Smith, 1996)
  • Replaces conventional scheduling logic with
    multiple FIFOs
  • Steering logic puts instructions into different
    FIFOs considering dependences
  • A FIFO contains a chain of dependent instructions
  • Only the head instructions are considered for
    issue

25
FIFO (contd)
  • Scheduling example

26
FIFO (contd)
  • Performance
  • Comparable performance to the conventional
    scheduling
  • Reduced scheduling logic complexity
  • Many related papers on clustered microarchitecture

27
Sequential wakeup (Kim, Lipasti, 2003)
  • Not all instruction have 2 pending source
    operands
  • only 416 of instructions have two source
    operands awakened before being issued
  • Such instructions usually have slack between two
    wakeups
  • Reduce load capacitance (and potentially length)
    of the wakeup bus by decoupling half of the tag
    comparators

Conventional
Seq wakeup
28
Sequential wakeup (contd)
  • Scheduling example
  • Performance of sequential wakeup
  • Average 0.4, worst 1 of IPC degradation
  • Up to 25 of frequency benefits in scheduling
    logic

29
Overcoming scheduling atomicity
  • Source of scheduling atomicity Minimal execution
    latency of instruction
  • Many ALU operations have single-cycle latency
  • Schedule should keep up with execution
  • 1-cycle instructions need 1-cycle scheduling
  • Multi-cycle operations do not need atomic
    scheduling
  • ? Relax the constraints by increasing the size of
    scheduling unit
  • Combine multiple instructions into a multi-cycle
    latency unit
  • Pipelined scheduling logic can issue original
    instructions in consecutive clock cycles

30
Macro-op scheduling (Kim Ph.D. thesis, 2004)
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
31
MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5
  • 9 cycles
  • 16 queue entries
  • 10 cycles
  • 9 queue entries

7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16
  • Pipelined instruction scheduling of multi-cycle
    MOPs
  • Still issues original instructions consecutively
  • Bigger instruction window
  • Multiple original instructions logically share a
    single issue queue entry

32
MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB
  • Benefits from both relaxed atomicity and
    scalability constraints
  • ? Pipelined 2-cycle MOP scheduling performs
    comparably or even better than atomic scheduling

33
Summary of instruction scheduling
  • Instruction scheduling a set of wakeup and
    select operations
  • Scheduling atomicity should be maintained to
    ensure consecutive execution of dependent
    instructions
  • Wakeup and select is not easily pipelined
  • Speculative scheduling is essential for achieving
    scheduling atomicity if the scheduling loop
    stretches over multiple pipeline stages
  • Speculative wavefront propagation, scheduling
    replay schemes
  • TONS of complexity reduction techniques for
    better instruction scheduling
  • Macro-ops used in Intel Core 2 Duo (cmp-jmp only)
  • Extended to mini-graphs Bracy, Roth
  • Also attempts to pipeline scheduling
    select-free, grandparent, etc.
Write a Comment
User Comments (0)
About PowerShow.com