Title: Dynamically Collapsing Dependencies for IPC and Frequency Gain
1Dynamically Collapsing Dependencies for IPC and
Frequency Gain
- Peter G. Sassone
- D. Scott Wills
- Georgia Tech
- Electrical and Computer Engineering
- sassone, scott.wills _at_ ece.gatech.edu
2Motivation
I cache
fetch
memory
decode
L2 cache
rename
issue
D cache
exec
commit
- Outside of pipeline, global communication
dominates - Memory wall is well studied
- Inside, traditionally computation or logic
dominated
3Motivation
issue logic
reg file
alu
issue queue
alu
alu
- Now dominated by local communication paths
- issue window
- reorder buffer
- register file
- bypass network
- Bottlenecks both IPC and frequency
4Motivation
issue logic
reg file
alu
issue queue
alu
alu
- RISC instruction sets create superfluous traffic
- All instructions and operands are treated as
equal - Little focus on exposing sequentiality
5Contributions
- Dynamic Strands
- collapse dependence-chains without fan-out
- exploit properties for simple value
precomputation - increase efficiency of critical resources
- preserve binary compatibility
- IPC improvements
- 17-20 speedup on Spec2000int and MediaBench
- Frequency improvements
- 37 fewer in-flight instructions
- reduced dependence on dependencies
6Outline
- Motivation
- Transient Operands and Strands
- Instruction Replacement Hardware
- Results
- Conclusion
7Dyadic Dilemma
- Performing any operation on more than two sources
requires temporary values
int sum( int a, int b, int c, int d )
return a b c d
. . . add R1 ? R1, R2 add R1 ? R1, R3 add R9 ?
R1, R4 . . .
8Transient Operands
- We term these temporary values transient
operands - values produced by an ALU inst
- values consumed only once, and only by an ALU
inst - Common in modern integer workloads
On average, about 40 of all dynamic operands are
transient
9Strands
- Strands
- linear chains of instructions joined by transient
operands - non-consecutive
- span basic blocks
- three instructions
- only the final output needs to be committed
- Strands are common
- dyadic temporaries
- compiler strategies
- language semantics
a
b
c
d
10Outline
- Motivation
- Transient Operands and Strands
- Instruction Replacement Hardware
- Results
- Conclusion
11Hardware Overview
off the critical path
fetch
decode
rename
issue queue
reg file
ALU
ALU
ALU
commit
12Algorithm Example
3
3
2
2
0
1
1
fetch
decode
rename
issue queue
reg file
ALU
ALU
ALU
commit
13Strand Cache Fill Unit
operand table
last producer instruction
last consumer instruction
consumer count
arch reg
1404 R5 ? R0 0
R4
1408 . . .
R5
PC 1416
PC 1404
1412 R1 ? R5 0
R6
1416 R5 ? R0 0
- Based around the operand table
- Detects conditions of transients
- When found
- append to existing strand
- begin new strand
14Strand Cache
About 175 bytes per line, though very few lines
are needed for effect
status bits
previous reader info
instructions
strand 1
101110101
strand 2
i1
i2
i3
pc
ready
value
strand 3
15Dispatch Engine
dirtytable
decode
pre-renamed instructions
dispatch engine
strand cache
rename
strands, recovery strands, kill signals,
- Watches for strand cache matches
- Inserts ready strands into the stream eagerly
- Removes component instructions when seen
- Correctness checking with dirty table
16Closed-Loop ALUs
freelocal bypass
ALU
½ cycle
mode switch
full bypass network
½ cycle
- Full bypass is half of the execute stage delay
- Regular ALUs with double-speed closed-loop mode
- two dependent ALU operations in a single cycle
- intermediate values (the transients) are
discarded! - final result still takes ½ cycle for full bypass
17Oops Dirty Read
insert recovery sub-strand to recover R1
load ? 16 R1
R1 is dirty!
18Oops Anti-Dependence Violation
previous value
R9
insert load immediate of previous value
load ?32 R9
renaming not sufficent outside reorder buffer
safety net
R9 has already been replaced
19Outline
- Motivation
- Transient Operands and Strands
- Instruction Replacement Hardware
- Results
- Conclusion
20Instruction Coverage
Average ALU inst coverage 16 12 1024 27
High coverage rates, but only with a big strand
cache.
Less than a 15 replacement rate, regardless of
cache size
coverage with various strand cache sizes
21IPC Improvements
Average IPC Speedup 4-wide 17 8-wide 20
Some benchmarks almost double in IPC
Some see almost no speedup at all
4-wide IPC speedup with 16-entry strand cache
22Resource Occupancy
strand
strand
- CISCification of instructions reduces traffic
- reorder buffer occupancy is reduced up to 37.
- issue queue occupancy is reduced up to 34.
- traffic reduction ? coverage
- Reduced dependence on dependencies
- opportunity for pipelined bypass
- opportunity for pipelined issue.
23Resource Occupancy
strand
strand
- Caveat emptor
- more worst case issue CAMs
- more worst case register ports
- Prior work applicable
- only 1.2 live inputs / strand
24Outline
- Motivation
- Transient Operands and Strands
- Instruction Replacement Hardware
- Results
- Conclusion
25Conclusion
- Key points
- eagerly executing macro-instructions ? value
precomputation - limiting focus to transient operands
- all new hardware off critical path
- Results
- IPC speedup of 18-20 with 3KB strand cache
- potential for frequency gains
- full binary compatibility
- Lots of current and future research
- relaxed constraint of ALU instructions
- quantified frequency improvements
- static detection of strands
Questions?
26Backup Slides
27Sensitivity to Dispatch Delay
On average, speedup only drops 1 with three
cycles of delay
Some actually get faster due to less errant
strands
Most benchmarks lose a small amount of speedup
4-wide IPC speedup with 16-entry strand cache