Title: Advanced Microarchitecture
1Advanced Microarchitecture
Prof. Mikko H. Lipasti University of
Wisconsin-Madison Lecture notes based on notes
by Ilhyun Kim Updated by Mikko Lipasti
2Outline
- Instruction scheduling overview
- Scheduling atomicity
- Speculative scheduling
- Scheduling recovery
- Complexity-effective instruction scheduling
techniques - CRIB reading
- Scalable load/store handling
- NoSQ reading
- Building large instruction windows
- Runahead, CFP, iCFP
- Control Independence
- 3D die stacking
3Readings
- Read on your own
- Shen Lipasti Chapter 10 on Advanced Register
Data Flow skim - I. Kim and M. Lipasti, Understanding Scheduling
Replay Schemes, in Proceedings of the 10th
International Symposium on High-performance
Computer Architecture (HPCA-10), February 2004. - Srikanth Srinivasan, Ravi Rajwar, Haitham
Akkary, Amit Gandhi, and Mike Upton, Continual
Flow Pipelines, in Proceedings of ASPLOS 2004,
October 2004. - Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric
Rotenberg, Haitham H. Akkary, Transparent
Control Independence, in Proceedings of ISCA-34,
2007. - To be discussed in class
- T. Shaw, M. Martin, A. Roth, NoSQ Store-Load
Communication without a Store Queue, in
Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture,
2006. - Erika Gunadi, Mikko Lipasti CRIB Combined
Rename, Issue, and Bypass, ISCA 2011. - Andrew Hilton, Amir Roth, "BOLT Energy-efficient
Out-of-Order Latency-Tolerant execution,"
Proceedings of HPCA 2010. - Loh, G. H., Xie, Y., and Black, B. 2007.
Processor Design in 3D Die-Stacking Technologies.
IEEE Micro 27, 3 (May. 2007), 31-48.
4Register Dataflow
5Instruction scheduling
- A process of mapping a series of instructions
into execution resources - Decides when and where an instruction is executed
6Instruction scheduling
- A set of wakeup and select operations
- Wakeup
- Broadcasts the tags of parent instructions
selected - Dependent instruction gets matching tags,
determines if source operands are ready - Resolves true data dependences
- Select
- Picks instructions to issue among a pool of ready
instructions - Resolves resource conflicts
- Issue bandwidth
- Limited number of functional units / memory ports
7Scheduling loop
- Basic wakeup and select operations
broadcast the tag of the selected inst
grant 0 request 0
scheduling loop
grant 1 request 1
grant n
selected
issue to FU
request n
ready - request
Wakeup logic
Select logic
8Wakeup and Select
9Scheduling Atomicity
- Operations in the scheduling loop must occur
within a single clock cycle - For back-to-back execution of dependent
instructions
10Implication of scheduling atomicity
- Pipelining is a standard way to improve clock
frequency - Hard to pipeline instruction scheduling logic
without losing ILP - 10 IPC loss in 2-cycle scheduling
- 19 IPC loss in 3-cycle scheduling
- A major obstacle to building high-frequency
microprocessors
11Scheduler Designs
- Data-Capture Scheduler
- keep the most recent register value in
reservation stations - Data forwarding and wakeup are combined
12Scheduler Designs
- Non-Data-Capture Scheduler
- keep the most recent register value in RF
(physical registers) - Data forwarding and wakeup are decoupled
- Complexity benefits
- simpler scheduler / data / wakeup path
13Mapping to pipeline stages
Data
Data / wakeup
- Pentium 4 (non-data-capture)
wakeup
Data
14Scheduling atomicity non-data-capture scheduler
- Multi-cycle scheduling loop
- Scheduling atomicity is not maintained
- Separated by extra pipeline stages (Disp, RF)
- Unable to issue dependent instructions
consecutively - ? solution speculative scheduling
15Speculative Scheduling
- Speculatively wakeup dependent instructions even
before the parent instruction starts execution - Keep the scheduling loop within a single clock
cycle - But, nobody knows what will happen in the future
- Source of uncertainty in instruction scheduling
loads - Cache hit / miss
- Store-to-load aliasing
- ? eventually affects timing decisions
- Scheduler assumes that all types of instructions
have pre-determined fixed latencies - Load instructions are assumed to have a common
case (over 90 in general) DL1 hit latency - If incorrect, subsequent (dependent) instructions
are replayed
16Speculative Scheduling
- Unlike the original Tomasulos algorithm
- Instructions are scheduled BEFORE actual
execution occurs - Assumes instructions have pre-determined fixed
latencies - ALU operations fixed latency
- Load operations assumes DL1 latency (common
case)
17Scheduling replay
- Speculation needs verification / recovery
- Theres no free lunch
- If the actual load latency is longer (i.e. cache
miss) than what was speculated - Best solution (disregarding complexity) replay
data-dependent instructions issued under load
shadow
Cache miss detected
18Wavefront propagation
- Speculative execution wavefront
- speculative image of execution (from schedulers
perspective) - Both wavefront propagates along dependence edges
at the same rate (1 level / cycle) - the real wavefront runs behind the speculative
wavefront - The load resolution loop delay complicates the
recovery process - scheduling miss is notified a couple of clock
cycles later after issue
19Load resolution feedback delay in instruction
scheduling
Broadcast / wakeup
Select
N
N
Time delay between sched and feedback
Dispatch / Payload
N-4
Execution
N-1
RF
Misc.
N-3
recover instructions in this path
N-2
- Scheduling runs multiple clock cycles ahead of
execution - But, instructions can keep track of only one
level of dependence at a time (using source
operand identifiers)
20Issues in scheduling replay
Exe
Sched / Issue
checker
cache miss signal
cycle n
cycle n1
cycle n2
cycle n3
- Cannot stop speculative wavefront propagation
- Both wavefronts propagate at the same rate
- Dependent instructions are unnecessarily issued
under load misses
21Requirements of scheduling replay
- Propagation of recovery status should be faster
than speculative wavefront propagation - Recovery should be performed on the transitive
closure of dependent instructions
- Conditions for ideal scheduling replay
- All mis-scheduled dependent instructions are
invalidated instantly - Independent instructions are unaffected
- Multiple levels of dependence tracking are needed
- e.g. Am I dependent on the current cache miss?
- Longer load resolution loop delay ? tracking more
levels
load miss
22Scheduling replay schemes
- Alpha 21264 Non-selective replay
- Replays all dependent and independent
instructions issued under load shadow - Analogous to squashing recovery in branch
misprediction - Simple but high performance penalty
- Independent instructions are unnecessarily
replayed
23Position-based selective replay
- Ideal selective recovery
- replay dependent instructions only
- Dependence tracking is managed in a matrix form
- Column load issue slot, row pipeline stages
24Low-complexity scheduling techniques
- FIFO (Palacharla, Jouppi, Smith, 1996)
- Replaces conventional scheduling logic with
multiple FIFOs - Steering logic puts instructions into different
FIFOs considering dependences - A FIFO contains a chain of dependent instructions
- Only the head instructions are considered for
issue
25FIFO (contd)
26FIFO (contd)
- Performance
- Comparable performance to the conventional
scheduling - Reduced scheduling logic complexity
- Many related papers on clustered microarchitecture
27CRIB Reading
- Erika Gunadi, Mikko Lipasti CRIB Combined
Rename, Issue, and Bypass, ISCA 2011. - Goals
- Match OOO performance per cycle
- Match OOO frequency
- Match OOO area
- Reduce power significantly
- Eliminate pipelines, latches, rename structures,
issue logic
28CRIB Data Movement
Front-End
ROB
RS
CRIB
PRF
ARF
Bypass
CRIB
ALU
In-place execution
Physical Register File - style
29In-place Execution
- First proposed by Ultrascalar 1999
- Place instructions in execution stations
- Route operands to instructions
- Goal massively wide issue
- Power constraints not even on the horizon
- CRIB in-place execution as enabler
- Eliminate pipelined execution lanes, multiported
RF, renaming, wakeup select, clock loads - Enable efficient speculation recovery
- Enable variable execution latency tolerance
30CRIB Concept
Next Entry
C
C
Source1
Source2
Destination
C
C
ALU
C
R0
R1
R2
R3
C
C
C
C
Previous Entry
WE
- Data values propagate combinationally (no
latches) - Completion bit propagates synchronously (latched)
- Instructions stay until all are finished
- When all are finished, latch data into ARF latches
31Renaming in CRIB
C
Cyc 1
Add R2, R0, R0 Sub R3, R0, R2 Add R2, R2, R3 Add
R2, R0, R3
C
Cyc 3
Source1
Source2
Destination
C
Cyc 2
C
Cyc 1
R0
R1
R2
R3
C
C
C
C
- All the connections forms in parallel after
dispatch - Dependency is solved by the positional renaming
- Instructions issue subject to the readiness of
its operands
32Scaling Up CRIB
LQ Bank
LQ Bank
Front End
ARF
SQ
SQ
Mult/Div
ARF
ARF
LQ Bank
LQ Bank
SQ
SQ
ARF
Cache Port
- Multiple CRIB partitions maintained as circular
queue - Only head ARF has committed state
- Other latches are left transparent
33Data Propagation across partitions
- Transparent latches for data
- Regular latches for complete bit
- Data values take one additional cycle to travel
to the next partition
R0
C
R1
C
R2
C
R3
C
Cycle 3
Cycle 2
R0
C
R1
C
R2
C
R3
C
Cycle 1
Cycle 0
R0
C
R1
C
R2
C
R3
C
34CRIB Pipeline Diagram
data linking
dependence linking
WB
Cmt
Rnm
Disp
Issue
RF
WB
OoO
dependence / data linking
Int
Disp
WB
A-Gen
Load
Load
CRIB
- Fewer pipe stages
- Remove rename stage from front-end
- Remove issue and RF from middle
- Combine dependence and data linking
35Load-Store Ordering
LQ
ADD R2, R3, R1
Data
Addr
ADD R2, R1, R1
LD R3, R1, R2
SQ
ST R0, R1, 1
R0
R1
R2
R3
- Loads/stores are ordered aggressively
- Recovery replay in place
- No prediction needed recovery is cheap
36Branch Misprediction
Instruction 3
NOP
Instruction 2
NOP
branch mispredict
Instruction 0
R0
R1
R2
R3
- Mispredicted branch drives a global signal up the
CRIB - Forces younger instructions to transform into
NOPs - Simpler than checkpointing or ROB unrolling
37CRIB Findings
- CRIB proposal appears promising
- Competitive IPC and area
- Dramatic power reductions
- Over baseline1 (Bobcat)
- 45 less energy per instruction
- 20-30 better IPC
- Over baseline2 (Nehalem)
- 75 less energy per instruction
- INT IPC slightly better, FP IPC slightly worse
38CRIB Summary
- Instructions are inserted from front end
- Instructions inside CRIB execute subject to
readiness of operands - Data propagates without latches
- Complete bit ensures that data propagate
synchronously - A CRIB retires when all instructions done
executing - When a CRIB retires, data are latched in the ARF
39Memory Dataflow
40Scalable Load/Store Queues
- Load queue/store queue
- Large instruction window many loads and stores
have to be buffered (25/15 of mix) - Expensive searches
- positional-associative searches in SQ,
- associative lookups in LQ
- coherence, speculative load scheduling
- Power/area/delay are prohibitive
41Store Queue/Load Queue Scaling
- Multilevel queues
- Bloom filters (quick check for independence)
- Eliminate associative load queue via replay
- Cain 2004
- Issue loads again at commit, in order
- Check to see if same value is returned
- Filter load checks for efficiency
- Most loads dont issue out of order (no
speculation) - Most loads dont coincide with coherence traffic
42SVW and NoSQ
- Store Vulnerability Window (SVW)
- Assign sequence numbers to stores
- Track writes to cache with sequence numbers
- Efficiently filter out safe loads/stores by only
checking against writes in vulnerability window - NoSQ
- Rely on load/store alias prediction to satisfy
dependent pairs - Use SVW technique to check
43Store/Load Optimizations
- Weakness predictor still fails
- Machine should fail gracefully, not fall off a
cliff - Glass jaw
- Several other concurrent proposals
- DMDC, Fire-and-forget,
44Key Challenge MLP
- Tolerate/overlap memory latency
- Once first miss is encountered, find another one
- Naïve solution
- Implement a very large ROB, IQ, LSQ
- Power/area/delay make this infeasible
- Build virtual instruction window
45Runahead
- Use poison bits to eliminate miss-dependent load
program slice - Forward load slice processing is a very old idea
- Massive Memory Machine Garcia-Molina et al. 84
- Datascalar Burger, Kaxiras, Goodman 97
- Runahead proposed by Dundas, Mudge 97
- Checkpoint state, keep running
- When miss completes, return to checkpoint
- May need runahead cache for store/load
communication
46Waiting Instruction Buffer
- Lebeck et al. ISCA 2002
- Capture forward load slice in separate buffer
- Propagate poison bits to identify slice
- Relieve pressure on issue queue
- Reinsert instructions when load completes
- Very similar to Intel Pentium 4 replay mechanism
- But not publicly known at the time
47Continual Flow Pipelines
- Srinivasan et al. 2004
- Slice buffer extension of WIB
- Store operands in slice buffer as well to free up
buffer entries on OOO window - Relieve pressure on rename/physical registers
- Applicable to
- data-capture machines (Intel P6) or
- physical register file machines (Pentium 4)
- Also extended to in-order machines (iCFP)
- Challenge how to buffer loads/stores
- Reading Hilton Roth, BOLT, HPCA 2010
48Instruction Flow
49Transparent Control Independence
- Control flow graph convergence
- Execution reconverges after branches
- If-then-else constructs, etc.
- Can we fetch/execute instructions beyond
convergence point? - How do we resolve ambiguous register and memory
dependences - Writes may or may not occur in branch shadow
- TCI employs CFP-like slice buffer to solve these
problems - Instructions with ambiguous dependences buffered
- Reinsert them the same way forward load miss
slice is reinserted - Best CI proposal to date, but still very
complex and expensive, with moderate payback
50Summary of Advanced Microarchitecture
- Instruction scheduling overview
- Scheduling atomicity
- Speculative scheduling
- Scheduling recovery
- Complexity-effective instruction scheduling
techniques - CRIB reading
- Scalable load/store handling
- NoSQ reading
- Building large instruction windows
- Runahead, CFP, iCFP
- Control Independence