Advanced Register Dataflow - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Advanced Register Dataflow

Description:

Shen & Lipasti Chapter 10 on Advanced Register Data Flow skim ... Implication of scheduling atomicity. Pipelining is a standard way to improve clock frequency ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 34

Provided by: ikim

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Register Dataflow

1
Advanced Register Dataflow
Prof. Mikko H. Lipasti University of
Wisconsin-Madison Lecture notes based on notes
by Ilhyun Kim Updated by Mikko Lipasti
2
Outline

Instruction scheduling overview
Scheduling atomicity
Speculative scheduling
Scheduling recovery
Complexity-effective instruction scheduling
techniques

3
Readings

Read on your own
Shen Lipasti Chapter 10 on Advanced Register
Data Flow skim
I. Kim and M. Lipasti, Understanding Scheduling
Replay Schemes, in Proceedings of the 10th
International Symposium on High-performance
Computer Architecture (HPCA-10), February 2004.
To be discussed in class
Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary,
Amit Gandhi, and Mike Upton, Continual Flow
Pipelines, in Proceedings of ASPLOS 2004,
October 2004.
T. Shaw, M. Martin, A. Roth, NoSQ Store-Load
Communication without a Store Queue, in
Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture,
2006.
Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric
Rotenberg, Haitham H. Akkary, Transparent
Control Independence, in Proceedings of ISCA-34,
2007.

4
We are talking about

5
Instruction scheduling

A process of mapping a series of instructions
into execution resources
Decides when and where an instruction is executed

Data dependence graph

Mapped to two FUs

6
Instruction scheduling

A set of wakeup and select operations
Wakeup
Broadcasts the tags of parent instructions
selected
Dependent instruction gets matching tags,
determines if source operands are ready
Resolves true data dependences
Select
Picks instructions to issue among a pool of ready
instructions
Resolves resource conflicts
Issue bandwidth
Limited number of functional units / memory ports

7
Scheduling loop

Basic wakeup and select operations

broadcast the tag of the selected inst
grant 0 request 0
scheduling loop
grant 1 request 1

grant n
selected
issue to FU
request n
ready - request
Wakeup logic
Select logic
8
Wakeup and Select
9
Scheduling Atomicity

Operations in the scheduling loop must occur
within a single clock cycle
For back-to-back execution of dependent
instructions

10
Implication of scheduling atomicity

Pipelining is a standard way to improve clock
frequency
Hard to pipeline instruction scheduling logic
without losing ILP
10 IPC loss in 2-cycle scheduling
19 IPC loss in 3-cycle scheduling
A major obstacle to building high-frequency
microprocessors

11
Scheduler Designs

Data-Capture Scheduler
keep the most recent register value in
reservation stations
Data forwarding and wakeup are combined

12
Scheduler Designs

Non-Data-Capture Scheduler
keep the most recent register value in RF
(physical registers)
Data forwarding and wakeup are decoupled

Complexity benefits
simpler scheduler / data / wakeup path

13
Mapping to pipeline stages

AMD K7 (data-capture)

Data
Data / wakeup

Pentium 4 (non-data-capture)

wakeup
Data
14
Scheduling atomicity non-data-capture scheduler

Multi-cycle scheduling loop
Scheduling atomicity is not maintained
Separated by extra pipeline stages (Disp, RF)
Unable to issue dependent instructions
consecutively
? solution speculative scheduling

15
Speculative Scheduling

Speculatively wakeup dependent instructions even
before the parent instruction starts execution
Keep the scheduling loop within a single clock
cycle
But, nobody knows what will happen in the future
Source of uncertainty in instruction scheduling
loads
Cache hit / miss
Store-to-load aliasing
? eventually affects timing decisions
Scheduler assumes that all types of instructions
have pre-determined fixed latencies
Load instructions are assumed to have a common
case (over 90 in general) DL1 hit latency
If incorrect, subsequent (dependent) instructions
are replayed

16
Speculative Scheduling

Overview

Unlike the original Tomasulos algorithm
Instructions are scheduled BEFORE actual
execution occurs
Assumes instructions have pre-determined fixed
latencies
ALU operations fixed latency
Load operations assumes DL1 latency (common
case)

17
Scheduling replay

Speculation needs verification / recovery
Theres no free lunch
If the actual load latency is longer (i.e. cache
miss) than what was speculated
Best solution (disregarding complexity) replay
data-dependent instructions issued under load
shadow

Cache miss detected
18
Wavefront propagation

Speculative execution wavefront
speculative image of execution (from schedulers
perspective)
Both wavefront propagates along dependence edges
at the same rate (1 level / cycle)
the real wavefront runs behind the speculative
wavefront
The load resolution loop delay complicates the
recovery process
scheduling miss is notified a couple of clock
cycles later after issue

19
Load resolution feedback delay in instruction
scheduling
Broadcast / wakeup
Select
N
N
Time delay between sched and feedback
Dispatch / Payload
N-4
Execution
N-1
RF
Misc.
N-3
recover instructions in this path
N-2

Scheduling runs multiple clock cycles ahead of
execution
But, instructions can keep track of only one
level of dependence at a time (using source
operand identifiers)

20
Issues in scheduling replay
Exe
Sched / Issue
checker
cache miss signal
cycle n
cycle n1
cycle n2
cycle n3

Cannot stop speculative wavefront propagation
Both wavefronts propagate at the same rate
Dependent instructions are unnecessarily issued
under load misses

21
Requirements of scheduling replay

Propagation of recovery status should be faster
than speculative wavefront propagation
Recovery should be performed on the transitive
closure of dependent instructions

Conditions for ideal scheduling replay
All mis-scheduled dependent instructions are
invalidated instantly
Independent instructions are unaffected
Multiple levels of dependence tracking are needed
e.g. Am I dependent on the current cache miss?
Longer load resolution loop delay ? tracking more
levels

load miss
22
Scheduling replay schemes

Alpha 21264 Non-selective replay
Replays all dependent and independent
instructions issued under load shadow
Analogous to squashing recovery in branch
misprediction
Simple but high performance penalty
Independent instructions are unnecessarily
replayed

23
Position-based selective replay

Ideal selective recovery
replay dependent instructions only
Dependence tracking is managed in a matrix form
Column load issue slot, row pipeline stages

24
Low-complexity scheduling techniques

FIFO (Palacharla, Jouppi, Smith, 1996)
Replaces conventional scheduling logic with
multiple FIFOs
Steering logic puts instructions into different
FIFOs considering dependences
A FIFO contains a chain of dependent instructions
Only the head instructions are considered for
issue

25
FIFO (contd)

Scheduling example

26
FIFO (contd)

Performance
Comparable performance to the conventional
scheduling
Reduced scheduling logic complexity
Many related papers on clustered microarchitecture

27
Sequential wakeup (Kim, Lipasti, 2003)

Not all instruction have 2 pending source
operands
only 416 of instructions have two source
operands awakened before being issued
Such instructions usually have slack between two
wakeups
Reduce load capacitance (and potentially length)
of the wakeup bus by decoupling half of the tag
comparators

Conventional
Seq wakeup
28
Sequential wakeup (contd)

Scheduling example
Performance of sequential wakeup
Average 0.4, worst 1 of IPC degradation
Up to 25 of frequency benefits in scheduling
logic

29
Overcoming scheduling atomicity

Source of scheduling atomicity Minimal execution
latency of instruction
Many ALU operations have single-cycle latency
Schedule should keep up with execution
1-cycle instructions need 1-cycle scheduling
Multi-cycle operations do not need atomic
scheduling
? Relax the constraints by increasing the size of
scheduling unit
Combine multiple instructions into a multi-cycle
latency unit
Pipelined scheduling logic can issue original
instructions in consecutive clock cycles

30
Macro-op scheduling (Kim Ph.D. thesis, 2004)
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
31
MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5

9 cycles
16 queue entries

10 cycles
9 queue entries

7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16

Pipelined instruction scheduling of multi-cycle
MOPs
Still issues original instructions consecutively
Bigger instruction window
Multiple original instructions logically share a
single issue queue entry

32
MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB

Benefits from both relaxed atomicity and
scalability constraints
? Pipelined 2-cycle MOP scheduling performs
comparably or even better than atomic scheduling

33
Summary of instruction scheduling

Instruction scheduling a set of wakeup and
select operations
Scheduling atomicity should be maintained to
ensure consecutive execution of dependent
instructions
Wakeup and select is not easily pipelined
Speculative scheduling is essential for achieving
scheduling atomicity if the scheduling loop
stretches over multiple pipeline stages
Speculative wavefront propagation, scheduling
replay schemes
TONS of complexity reduction techniques for
better instruction scheduling
Macro-ops used in Intel Core 2 Duo (cmp-jmp only)
Extended to mini-graphs Bracy, Roth
Also attempts to pipeline scheduling
select-free, grandparent, etc.