Advanced Microarchitecture - PowerPoint PPT Presentation

About This Presentation

Title:

Advanced Microarchitecture

Description:

Advanced Microarchitecture Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 51

Provided by: iki9

Learn more at: https://ece752.ece.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Microarchitecture

1
Advanced Microarchitecture
Prof. Mikko H. Lipasti University of
Wisconsin-Madison Lecture notes based on notes
by Ilhyun Kim Updated by Mikko Lipasti
2
Outline

Instruction scheduling overview
Scheduling atomicity
Speculative scheduling
Scheduling recovery
Complexity-effective instruction scheduling
techniques
CRIB reading
Scalable load/store handling
NoSQ reading
Building large instruction windows
Runahead, CFP, iCFP
Control Independence
3D die stacking

3
Readings

Read on your own
Shen Lipasti Chapter 10 on Advanced Register
Data Flow skim
I. Kim and M. Lipasti, Understanding Scheduling
Replay Schemes, in Proceedings of the 10th
International Symposium on High-performance
Computer Architecture (HPCA-10), February 2004.
Srikanth Srinivasan, Ravi Rajwar, Haitham
Akkary, Amit Gandhi, and Mike Upton, Continual
Flow Pipelines, in Proceedings of ASPLOS 2004,
October 2004.
Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric
Rotenberg, Haitham H. Akkary, Transparent
Control Independence, in Proceedings of ISCA-34,
2007.
To be discussed in class
T. Shaw, M. Martin, A. Roth, NoSQ Store-Load
Communication without a Store Queue, in
Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture,
2006.
Erika Gunadi, Mikko Lipasti CRIB Combined
Rename, Issue, and Bypass, ISCA 2011.
Andrew Hilton, Amir Roth, "BOLT Energy-efficient
Out-of-Order Latency-Tolerant execution,"
Proceedings of HPCA 2010.
Loh, G. H., Xie, Y., and Black, B. 2007.
Processor Design in 3D Die-Stacking Technologies.
IEEE Micro 27, 3 (May. 2007), 31-48.

4
Register Dataflow
5
Instruction scheduling

A process of mapping a series of instructions
into execution resources
Decides when and where an instruction is executed

Data dependence graph

Mapped to two FUs

6
Instruction scheduling

A set of wakeup and select operations
Wakeup
Broadcasts the tags of parent instructions
selected
Dependent instruction gets matching tags,
determines if source operands are ready
Resolves true data dependences
Select
Picks instructions to issue among a pool of ready
instructions
Resolves resource conflicts
Issue bandwidth
Limited number of functional units / memory ports

7
Scheduling loop

Basic wakeup and select operations

broadcast the tag of the selected inst
grant 0 request 0
scheduling loop
grant 1 request 1

grant n
selected
issue to FU
request n
ready - request
Wakeup logic
Select logic
8
Wakeup and Select
9
Scheduling Atomicity

Operations in the scheduling loop must occur
within a single clock cycle
For back-to-back execution of dependent
instructions

10
Implication of scheduling atomicity

Pipelining is a standard way to improve clock
frequency
Hard to pipeline instruction scheduling logic
without losing ILP
10 IPC loss in 2-cycle scheduling
19 IPC loss in 3-cycle scheduling
A major obstacle to building high-frequency
microprocessors

11
Scheduler Designs

Data-Capture Scheduler
keep the most recent register value in
reservation stations
Data forwarding and wakeup are combined

12
Scheduler Designs

Non-Data-Capture Scheduler
keep the most recent register value in RF
(physical registers)
Data forwarding and wakeup are decoupled

Complexity benefits
simpler scheduler / data / wakeup path

13
Mapping to pipeline stages

AMD K7 (data-capture)

Data
Data / wakeup

Pentium 4 (non-data-capture)

wakeup
Data
14
Scheduling atomicity non-data-capture scheduler

Multi-cycle scheduling loop
Scheduling atomicity is not maintained
Separated by extra pipeline stages (Disp, RF)
Unable to issue dependent instructions
consecutively
? solution speculative scheduling

15
Speculative Scheduling

Speculatively wakeup dependent instructions even
before the parent instruction starts execution
Keep the scheduling loop within a single clock
cycle
But, nobody knows what will happen in the future
Source of uncertainty in instruction scheduling
loads
Cache hit / miss
Store-to-load aliasing
? eventually affects timing decisions
Scheduler assumes that all types of instructions
have pre-determined fixed latencies
Load instructions are assumed to have a common
case (over 90 in general) DL1 hit latency
If incorrect, subsequent (dependent) instructions
are replayed

16
Speculative Scheduling

Overview

Unlike the original Tomasulos algorithm
Instructions are scheduled BEFORE actual
execution occurs
Assumes instructions have pre-determined fixed
latencies
ALU operations fixed latency
Load operations assumes DL1 latency (common
case)

17
Scheduling replay

Speculation needs verification / recovery
Theres no free lunch
If the actual load latency is longer (i.e. cache
miss) than what was speculated
Best solution (disregarding complexity) replay
data-dependent instructions issued under load
shadow

Cache miss detected
18
Wavefront propagation

Speculative execution wavefront
speculative image of execution (from schedulers
perspective)
Both wavefront propagates along dependence edges
at the same rate (1 level / cycle)
the real wavefront runs behind the speculative
wavefront
The load resolution loop delay complicates the
recovery process
scheduling miss is notified a couple of clock
cycles later after issue

19
Load resolution feedback delay in instruction
scheduling
Broadcast / wakeup
Select
N
N
Time delay between sched and feedback
Dispatch / Payload
N-4
Execution
N-1
RF
Misc.
N-3
recover instructions in this path
N-2

Scheduling runs multiple clock cycles ahead of
execution
But, instructions can keep track of only one
level of dependence at a time (using source
operand identifiers)

20
Issues in scheduling replay
Exe
Sched / Issue
checker
cache miss signal
cycle n
cycle n1
cycle n2
cycle n3

Cannot stop speculative wavefront propagation
Both wavefronts propagate at the same rate
Dependent instructions are unnecessarily issued
under load misses

21
Requirements of scheduling replay

Propagation of recovery status should be faster
than speculative wavefront propagation
Recovery should be performed on the transitive
closure of dependent instructions

Conditions for ideal scheduling replay
All mis-scheduled dependent instructions are
invalidated instantly
Independent instructions are unaffected
Multiple levels of dependence tracking are needed
e.g. Am I dependent on the current cache miss?
Longer load resolution loop delay ? tracking more
levels

load miss
22
Scheduling replay schemes

Alpha 21264 Non-selective replay
Replays all dependent and independent
instructions issued under load shadow
Analogous to squashing recovery in branch
misprediction
Simple but high performance penalty
Independent instructions are unnecessarily
replayed

23
Position-based selective replay

Ideal selective recovery
replay dependent instructions only
Dependence tracking is managed in a matrix form
Column load issue slot, row pipeline stages

24
Low-complexity scheduling techniques

FIFO (Palacharla, Jouppi, Smith, 1996)
Replaces conventional scheduling logic with
multiple FIFOs
Steering logic puts instructions into different
FIFOs considering dependences
A FIFO contains a chain of dependent instructions
Only the head instructions are considered for
issue

25
FIFO (contd)

Scheduling example

26
FIFO (contd)

Performance
Comparable performance to the conventional
scheduling
Reduced scheduling logic complexity
Many related papers on clustered microarchitecture

27
CRIB Reading

Erika Gunadi, Mikko Lipasti CRIB Combined
Rename, Issue, and Bypass, ISCA 2011.
Goals
Match OOO performance per cycle
Match OOO frequency
Match OOO area
Reduce power significantly
Eliminate pipelines, latches, rename structures,
issue logic

28
CRIB Data Movement
Front-End
ROB
RS
CRIB
PRF
ARF
Bypass
CRIB
ALU
In-place execution
Physical Register File - style
29
In-place Execution

First proposed by Ultrascalar 1999
Place instructions in execution stations
Route operands to instructions
Goal massively wide issue
Power constraints not even on the horizon
CRIB in-place execution as enabler
Eliminate pipelined execution lanes, multiported
RF, renaming, wakeup select, clock loads
Enable efficient speculation recovery
Enable variable execution latency tolerance

30
CRIB Concept
Next Entry
C
C
Source1
Source2
Destination
C
C
ALU
C
R0
R1
R2
R3
C
C
C
C
Previous Entry
WE

Data values propagate combinationally (no
latches)
Completion bit propagates synchronously (latched)
Instructions stay until all are finished
When all are finished, latch data into ARF latches

31
Renaming in CRIB
C
Cyc 1
Add R2, R0, R0 Sub R3, R0, R2 Add R2, R2, R3 Add
R2, R0, R3
C
Cyc 3
Source1
Source2
Destination
C
Cyc 2
C
Cyc 1
R0
R1
R2
R3
C
C
C
C

All the connections forms in parallel after
dispatch
Dependency is solved by the positional renaming
Instructions issue subject to the readiness of
its operands

32
Scaling Up CRIB
LQ Bank
LQ Bank
Front End
ARF
SQ
SQ
Mult/Div
ARF
ARF
LQ Bank
LQ Bank
SQ
SQ
ARF
Cache Port

Multiple CRIB partitions maintained as circular
queue
Only head ARF has committed state
Other latches are left transparent

33
Data Propagation across partitions

Transparent latches for data
Regular latches for complete bit
Data values take one additional cycle to travel
to the next partition

R0
C
R1
C
R2
C
R3
C
Cycle 3
Cycle 2
R0
C
R1
C
R2
C
R3
C
Cycle 1
Cycle 0
R0
C
R1
C
R2
C
R3
C
34
CRIB Pipeline Diagram
data linking
dependence linking
WB
Cmt
Rnm
Disp
Issue
RF
WB
OoO
dependence / data linking
Int
Disp
WB
A-Gen
Load
Load
CRIB

Fewer pipe stages
Remove rename stage from front-end
Remove issue and RF from middle
Combine dependence and data linking

35
Load-Store Ordering
LQ
ADD R2, R3, R1
Data
Addr
ADD R2, R1, R1
LD R3, R1, R2
SQ
ST R0, R1, 1
R0
R1
R2
R3

Loads/stores are ordered aggressively
Recovery replay in place
No prediction needed recovery is cheap

36
Branch Misprediction
Instruction 3
NOP
Instruction 2
NOP
branch mispredict
Instruction 0
R0
R1
R2
R3

Mispredicted branch drives a global signal up the
CRIB
Forces younger instructions to transform into
NOPs
Simpler than checkpointing or ROB unrolling

37
CRIB Findings

CRIB proposal appears promising
Competitive IPC and area
Dramatic power reductions
Over baseline1 (Bobcat)
45 less energy per instruction
20-30 better IPC
Over baseline2 (Nehalem)
75 less energy per instruction
INT IPC slightly better, FP IPC slightly worse

38
CRIB Summary

Instructions are inserted from front end
Instructions inside CRIB execute subject to
readiness of operands
Data propagates without latches
Complete bit ensures that data propagate
synchronously
A CRIB retires when all instructions done
executing
When a CRIB retires, data are latched in the ARF

39
Memory Dataflow
40
Scalable Load/Store Queues

Load queue/store queue
Large instruction window many loads and stores
have to be buffered (25/15 of mix)
Expensive searches
positional-associative searches in SQ,
associative lookups in LQ
coherence, speculative load scheduling
Power/area/delay are prohibitive

41
Store Queue/Load Queue Scaling

Multilevel queues
Bloom filters (quick check for independence)
Eliminate associative load queue via replay
Cain 2004
Issue loads again at commit, in order
Check to see if same value is returned
Filter load checks for efficiency
Most loads dont issue out of order (no
speculation)
Most loads dont coincide with coherence traffic

42
SVW and NoSQ

Store Vulnerability Window (SVW)
Assign sequence numbers to stores
Track writes to cache with sequence numbers
Efficiently filter out safe loads/stores by only
checking against writes in vulnerability window
NoSQ
Rely on load/store alias prediction to satisfy
dependent pairs
Use SVW technique to check

43
Store/Load Optimizations

Weakness predictor still fails
Machine should fail gracefully, not fall off a
cliff
Glass jaw
Several other concurrent proposals
DMDC, Fire-and-forget,

44
Key Challenge MLP

Tolerate/overlap memory latency
Once first miss is encountered, find another one
Naïve solution
Implement a very large ROB, IQ, LSQ
Power/area/delay make this infeasible
Build virtual instruction window

45
Runahead

Use poison bits to eliminate miss-dependent load
program slice
Forward load slice processing is a very old idea
Massive Memory Machine Garcia-Molina et al. 84
Datascalar Burger, Kaxiras, Goodman 97
Runahead proposed by Dundas, Mudge 97
Checkpoint state, keep running
When miss completes, return to checkpoint
May need runahead cache for store/load
communication

46
Waiting Instruction Buffer

Lebeck et al. ISCA 2002
Capture forward load slice in separate buffer
Propagate poison bits to identify slice
Relieve pressure on issue queue
Reinsert instructions when load completes
Very similar to Intel Pentium 4 replay mechanism
But not publicly known at the time

47
Continual Flow Pipelines

Srinivasan et al. 2004
Slice buffer extension of WIB
Store operands in slice buffer as well to free up
buffer entries on OOO window
Relieve pressure on rename/physical registers
Applicable to
data-capture machines (Intel P6) or
physical register file machines (Pentium 4)
Also extended to in-order machines (iCFP)
Challenge how to buffer loads/stores
Reading Hilton Roth, BOLT, HPCA 2010

48
Instruction Flow
49
Transparent Control Independence

Control flow graph convergence
Execution reconverges after branches
If-then-else constructs, etc.
Can we fetch/execute instructions beyond
convergence point?
How do we resolve ambiguous register and memory
dependences
Writes may or may not occur in branch shadow
TCI employs CFP-like slice buffer to solve these
problems
Instructions with ambiguous dependences buffered
Reinsert them the same way forward load miss
slice is reinserted
Best CI proposal to date, but still very
complex and expensive, with moderate payback