Dynamically Collapsing Dependencies for IPC and Frequency Gain - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamically Collapsing Dependencies for IPC and Frequency Gain

Description:

Outside of pipeline, global communication dominates. Memory ... Caveat emptor. more worst case issue CAMs. more worst case register ports. Prior work applicable ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 28
Provided by: micro53
Learn more at: https://microarch.org
Category:

less

Transcript and Presenter's Notes

Title: Dynamically Collapsing Dependencies for IPC and Frequency Gain


1
Dynamically Collapsing Dependencies for IPC and
Frequency Gain
  • Peter G. Sassone
  • D. Scott Wills
  • Georgia Tech
  • Electrical and Computer Engineering
  • sassone, scott.wills _at_ ece.gatech.edu

2
Motivation
I cache
fetch
memory
decode
L2 cache
rename
issue
D cache
exec
commit
  • Outside of pipeline, global communication
    dominates
  • Memory wall is well studied
  • Inside, traditionally computation or logic
    dominated

3
Motivation
issue logic
reg file
alu
issue queue
alu
alu
  • Now dominated by local communication paths
  • issue window
  • reorder buffer
  • register file
  • bypass network
  • Bottlenecks both IPC and frequency

4
Motivation
issue logic
reg file
alu
issue queue
alu
alu
  • RISC instruction sets create superfluous traffic
  • All instructions and operands are treated as
    equal
  • Little focus on exposing sequentiality

5
Contributions
  • Dynamic Strands
  • collapse dependence-chains without fan-out
  • exploit properties for simple value
    precomputation
  • increase efficiency of critical resources
  • preserve binary compatibility
  • IPC improvements
  • 17-20 speedup on Spec2000int and MediaBench
  • Frequency improvements
  • 37 fewer in-flight instructions
  • reduced dependence on dependencies

6
Outline
  • Motivation
  • Transient Operands and Strands
  • Instruction Replacement Hardware
  • Results
  • Conclusion

7
Dyadic Dilemma
  • Performing any operation on more than two sources
    requires temporary values

int sum( int a, int b, int c, int d )
return a b c d
. . . add R1 ? R1, R2 add R1 ? R1, R3 add R9 ?
R1, R4 . . .
8
Transient Operands
  • We term these temporary values transient
    operands
  • values produced by an ALU inst
  • values consumed only once, and only by an ALU
    inst
  • Common in modern integer workloads

On average, about 40 of all dynamic operands are
transient
9
Strands
  • Strands
  • linear chains of instructions joined by transient
    operands
  • non-consecutive
  • span basic blocks
  • three instructions
  • only the final output needs to be committed
  • Strands are common
  • dyadic temporaries
  • compiler strategies
  • language semantics

a
b

c
d


10
Outline
  • Motivation
  • Transient Operands and Strands
  • Instruction Replacement Hardware
  • Results
  • Conclusion

11
Hardware Overview
off the critical path
fetch
decode
rename
issue queue
reg file
ALU
ALU
ALU
commit
12
Algorithm Example
3
3
2
2
0
1
1
fetch
decode
rename
issue queue
reg file
ALU
ALU
ALU
commit
13
Strand Cache Fill Unit
operand table
last producer instruction
last consumer instruction
consumer count
arch reg
1404 R5 ? R0 0
R4
1408 . . .
R5
PC 1416
PC 1404
1412 R1 ? R5 0
R6
1416 R5 ? R0 0
  • Based around the operand table
  • Detects conditions of transients
  • When found
  • append to existing strand
  • begin new strand

14
Strand Cache
About 175 bytes per line, though very few lines
are needed for effect
status bits
previous reader info
instructions
strand 1
101110101
strand 2
i1
i2
i3
pc
ready
value
strand 3
15
Dispatch Engine
dirtytable
decode
pre-renamed instructions
dispatch engine
strand cache
rename
strands, recovery strands, kill signals,
  • Watches for strand cache matches
  • Inserts ready strands into the stream eagerly
  • Removes component instructions when seen
  • Correctness checking with dirty table

16
Closed-Loop ALUs
freelocal bypass
ALU
½ cycle
mode switch
full bypass network
½ cycle
  • Full bypass is half of the execute stage delay
  • Regular ALUs with double-speed closed-loop mode
  • two dependent ALU operations in a single cycle
  • intermediate values (the transients) are
    discarded!
  • final result still takes ½ cycle for full bypass

17
Oops Dirty Read
insert recovery sub-strand to recover R1
load ? 16 R1
R1 is dirty!
18
Oops Anti-Dependence Violation
previous value
R9
insert load immediate of previous value
load ?32 R9
renaming not sufficent outside reorder buffer
safety net
R9 has already been replaced
19
Outline
  • Motivation
  • Transient Operands and Strands
  • Instruction Replacement Hardware
  • Results
  • Conclusion

20
Instruction Coverage
Average ALU inst coverage 16 12 1024 27
High coverage rates, but only with a big strand
cache.
Less than a 15 replacement rate, regardless of
cache size
coverage with various strand cache sizes
21
IPC Improvements
Average IPC Speedup 4-wide 17 8-wide 20
Some benchmarks almost double in IPC
Some see almost no speedup at all
4-wide IPC speedup with 16-entry strand cache
22
Resource Occupancy
strand
strand
  • CISCification of instructions reduces traffic
  • reorder buffer occupancy is reduced up to 37.
  • issue queue occupancy is reduced up to 34.
  • traffic reduction ? coverage
  • Reduced dependence on dependencies
  • opportunity for pipelined bypass
  • opportunity for pipelined issue.

23
Resource Occupancy
strand
strand









  • Caveat emptor
  • more worst case issue CAMs
  • more worst case register ports
  • Prior work applicable
  • only 1.2 live inputs / strand

24
Outline
  • Motivation
  • Transient Operands and Strands
  • Instruction Replacement Hardware
  • Results
  • Conclusion

25
Conclusion
  • Key points
  • eagerly executing macro-instructions ? value
    precomputation
  • limiting focus to transient operands
  • all new hardware off critical path
  • Results
  • IPC speedup of 18-20 with 3KB strand cache
  • potential for frequency gains
  • full binary compatibility
  • Lots of current and future research
  • relaxed constraint of ALU instructions
  • quantified frequency improvements
  • static detection of strands

Questions?
26
Backup Slides
27
Sensitivity to Dispatch Delay
On average, speedup only drops 1 with three
cycles of delay
Some actually get faster due to less errant
strands
Most benchmarks lose a small amount of speedup
4-wide IPC speedup with 16-entry strand cache
Write a Comment
User Comments (0)
About PowerShow.com