Title: Advanced Register Dataflow
1Advanced Register Dataflow
Prof. Mikko H. Lipasti University of
Wisconsin-Madison Lecture notes based on notes
by Ilhyun Kim Updated by Mikko Lipasti
2Outline
- Instruction scheduling overview
- Scheduling atomicity
- Speculative scheduling
- Scheduling recovery
- Complexity-effective instruction scheduling
techniques
3Readings
- Read on your own
- Shen Lipasti Chapter 10 on Advanced Register
Data Flow skim - I. Kim and M. Lipasti, Understanding Scheduling
Replay Schemes, in Proceedings of the 10th
International Symposium on High-performance
Computer Architecture (HPCA-10), February 2004. - To be discussed in class
- Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary,
Amit Gandhi, and Mike Upton, Continual Flow
Pipelines, in Proceedings of ASPLOS 2004,
October 2004. - T. Shaw, M. Martin, A. Roth, NoSQ Store-Load
Communication without a Store Queue, in
Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture,
2006. - Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric
Rotenberg, Haitham H. Akkary, Transparent
Control Independence, in Proceedings of ISCA-34,
2007.
4We are talking about
5Instruction scheduling
- A process of mapping a series of instructions
into execution resources - Decides when and where an instruction is executed
6Instruction scheduling
- A set of wakeup and select operations
- Wakeup
- Broadcasts the tags of parent instructions
selected - Dependent instruction gets matching tags,
determines if source operands are ready - Resolves true data dependences
- Select
- Picks instructions to issue among a pool of ready
instructions - Resolves resource conflicts
- Issue bandwidth
- Limited number of functional units / memory ports
7Scheduling loop
- Basic wakeup and select operations
broadcast the tag of the selected inst
grant 0 request 0
scheduling loop
grant 1 request 1
grant n
selected
issue to FU
request n
ready - request
Wakeup logic
Select logic
8Wakeup and Select
9Scheduling Atomicity
- Operations in the scheduling loop must occur
within a single clock cycle - For back-to-back execution of dependent
instructions
10Implication of scheduling atomicity
- Pipelining is a standard way to improve clock
frequency - Hard to pipeline instruction scheduling logic
without losing ILP - 10 IPC loss in 2-cycle scheduling
- 19 IPC loss in 3-cycle scheduling
- A major obstacle to building high-frequency
microprocessors
11Scheduler Designs
- Data-Capture Scheduler
- keep the most recent register value in
reservation stations - Data forwarding and wakeup are combined
12Scheduler Designs
- Non-Data-Capture Scheduler
- keep the most recent register value in RF
(physical registers) - Data forwarding and wakeup are decoupled
- Complexity benefits
- simpler scheduler / data / wakeup path
13Mapping to pipeline stages
Data
Data / wakeup
- Pentium 4 (non-data-capture)
wakeup
Data
14Scheduling atomicity non-data-capture scheduler
- Multi-cycle scheduling loop
- Scheduling atomicity is not maintained
- Separated by extra pipeline stages (Disp, RF)
- Unable to issue dependent instructions
consecutively - ? solution speculative scheduling
15Speculative Scheduling
- Speculatively wakeup dependent instructions even
before the parent instruction starts execution - Keep the scheduling loop within a single clock
cycle - But, nobody knows what will happen in the future
- Source of uncertainty in instruction scheduling
loads - Cache hit / miss
- Store-to-load aliasing
- ? eventually affects timing decisions
- Scheduler assumes that all types of instructions
have pre-determined fixed latencies - Load instructions are assumed to have a common
case (over 90 in general) DL1 hit latency - If incorrect, subsequent (dependent) instructions
are replayed
16Speculative Scheduling
- Unlike the original Tomasulos algorithm
- Instructions are scheduled BEFORE actual
execution occurs - Assumes instructions have pre-determined fixed
latencies - ALU operations fixed latency
- Load operations assumes DL1 latency (common
case)
17Scheduling replay
- Speculation needs verification / recovery
- Theres no free lunch
- If the actual load latency is longer (i.e. cache
miss) than what was speculated - Best solution (disregarding complexity) replay
data-dependent instructions issued under load
shadow
Cache miss detected
18Wavefront propagation
- Speculative execution wavefront
- speculative image of execution (from schedulers
perspective) - Both wavefront propagates along dependence edges
at the same rate (1 level / cycle) - the real wavefront runs behind the speculative
wavefront - The load resolution loop delay complicates the
recovery process - scheduling miss is notified a couple of clock
cycles later after issue
19Load resolution feedback delay in instruction
scheduling
Broadcast / wakeup
Select
N
N
Time delay between sched and feedback
Dispatch / Payload
N-4
Execution
N-1
RF
Misc.
N-3
recover instructions in this path
N-2
- Scheduling runs multiple clock cycles ahead of
execution - But, instructions can keep track of only one
level of dependence at a time (using source
operand identifiers)
20Issues in scheduling replay
Exe
Sched / Issue
checker
cache miss signal
cycle n
cycle n1
cycle n2
cycle n3
- Cannot stop speculative wavefront propagation
- Both wavefronts propagate at the same rate
- Dependent instructions are unnecessarily issued
under load misses
21Requirements of scheduling replay
- Propagation of recovery status should be faster
than speculative wavefront propagation - Recovery should be performed on the transitive
closure of dependent instructions
- Conditions for ideal scheduling replay
- All mis-scheduled dependent instructions are
invalidated instantly - Independent instructions are unaffected
- Multiple levels of dependence tracking are needed
- e.g. Am I dependent on the current cache miss?
- Longer load resolution loop delay ? tracking more
levels
load miss
22Scheduling replay schemes
- Alpha 21264 Non-selective replay
- Replays all dependent and independent
instructions issued under load shadow - Analogous to squashing recovery in branch
misprediction - Simple but high performance penalty
- Independent instructions are unnecessarily
replayed
23Position-based selective replay
- Ideal selective recovery
- replay dependent instructions only
- Dependence tracking is managed in a matrix form
- Column load issue slot, row pipeline stages
24Low-complexity scheduling techniques
- FIFO (Palacharla, Jouppi, Smith, 1996)
- Replaces conventional scheduling logic with
multiple FIFOs - Steering logic puts instructions into different
FIFOs considering dependences - A FIFO contains a chain of dependent instructions
- Only the head instructions are considered for
issue
25FIFO (contd)
26FIFO (contd)
- Performance
- Comparable performance to the conventional
scheduling - Reduced scheduling logic complexity
- Many related papers on clustered microarchitecture
27Sequential wakeup (Kim, Lipasti, 2003)
- Not all instruction have 2 pending source
operands - only 416 of instructions have two source
operands awakened before being issued - Such instructions usually have slack between two
wakeups - Reduce load capacitance (and potentially length)
of the wakeup bus by decoupling half of the tag
comparators
Conventional
Seq wakeup
28Sequential wakeup (contd)
- Scheduling example
- Performance of sequential wakeup
- Average 0.4, worst 1 of IPC degradation
- Up to 25 of frequency benefits in scheduling
logic
29Overcoming scheduling atomicity
- Source of scheduling atomicity Minimal execution
latency of instruction - Many ALU operations have single-cycle latency
- Schedule should keep up with execution
- 1-cycle instructions need 1-cycle scheduling
- Multi-cycle operations do not need atomic
scheduling - ? Relax the constraints by increasing the size of
scheduling unit - Combine multiple instructions into a multi-cycle
latency unit - Pipelined scheduling logic can issue original
instructions in consecutive clock cycles
30Macro-op scheduling (Kim Ph.D. thesis, 2004)
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
31MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5
- 9 cycles
- 16 queue entries
- 10 cycles
- 9 queue entries
7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16
- Pipelined instruction scheduling of multi-cycle
MOPs - Still issues original instructions consecutively
- Bigger instruction window
- Multiple original instructions logically share a
single issue queue entry
32MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB
- Benefits from both relaxed atomicity and
scalability constraints - ? Pipelined 2-cycle MOP scheduling performs
comparably or even better than atomic scheduling
33Summary of instruction scheduling
- Instruction scheduling a set of wakeup and
select operations - Scheduling atomicity should be maintained to
ensure consecutive execution of dependent
instructions - Wakeup and select is not easily pipelined
- Speculative scheduling is essential for achieving
scheduling atomicity if the scheduling loop
stretches over multiple pipeline stages - Speculative wavefront propagation, scheduling
replay schemes - TONS of complexity reduction techniques for
better instruction scheduling - Macro-ops used in Intel Core 2 Duo (cmp-jmp only)
- Extended to mini-graphs Bracy, Roth
- Also attempts to pipeline scheduling
select-free, grandparent, etc.