Dynamically Collapsing Dependencies for IPC and Frequency Gain - PowerPoint PPT Presentation

About This Presentation

Title:

Dynamically Collapsing Dependencies for IPC and Frequency Gain

Description:

Outside of pipeline, global communication dominates. Memory ... Caveat emptor. more worst case issue CAMs. more worst case register ports. Prior work applicable ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 28

Provided by: micro53

Learn more at: https://microarch.org

Category:

more less

Transcript and Presenter's Notes

Title: Dynamically Collapsing Dependencies for IPC and Frequency Gain

1
Dynamically Collapsing Dependencies for IPC and
Frequency Gain

Peter G. Sassone
D. Scott Wills
Georgia Tech
Electrical and Computer Engineering
sassone, scott.wills _at_ ece.gatech.edu

2
Motivation
I cache
fetch
memory
decode
L2 cache
rename
issue
D cache
exec
commit

Outside of pipeline, global communication
dominates
Memory wall is well studied
Inside, traditionally computation or logic
dominated

3
Motivation
issue logic
reg file
alu
issue queue
alu
alu

Now dominated by local communication paths
issue window
reorder buffer
register file
bypass network
Bottlenecks both IPC and frequency

4
Motivation
issue logic
reg file
alu
issue queue
alu
alu

RISC instruction sets create superfluous traffic
All instructions and operands are treated as
equal
Little focus on exposing sequentiality

5
Contributions

Dynamic Strands
collapse dependence-chains without fan-out
exploit properties for simple value
precomputation
increase efficiency of critical resources
preserve binary compatibility
IPC improvements
17-20 speedup on Spec2000int and MediaBench
Frequency improvements
37 fewer in-flight instructions
reduced dependence on dependencies

6
Outline

Motivation
Transient Operands and Strands
Instruction Replacement Hardware
Results
Conclusion

7
Dyadic Dilemma

Performing any operation on more than two sources
requires temporary values

int sum( int a, int b, int c, int d )
return a b c d
. . . add R1 ? R1, R2 add R1 ? R1, R3 add R9 ?
R1, R4 . . .
8
Transient Operands

We term these temporary values transient
operands
values produced by an ALU inst
values consumed only once, and only by an ALU
inst
Common in modern integer workloads

On average, about 40 of all dynamic operands are
transient
9
Strands

Strands
linear chains of instructions joined by transient
operands
non-consecutive
span basic blocks
three instructions
only the final output needs to be committed
Strands are common
dyadic temporaries
compiler strategies
language semantics

a
b

c
d

10
Outline

Motivation
Transient Operands and Strands
Instruction Replacement Hardware
Results
Conclusion

11
Hardware Overview
off the critical path
fetch
decode
rename
issue queue
reg file
ALU
ALU
ALU
commit
12
Algorithm Example
3
3
2
2
0
1
1
fetch
decode
rename
issue queue
reg file
ALU
ALU
ALU
commit
13
Strand Cache Fill Unit
operand table
last producer instruction
last consumer instruction
consumer count
arch reg
1404 R5 ? R0 0
R4
1408 . . .
R5
PC 1416
PC 1404
1412 R1 ? R5 0
R6
1416 R5 ? R0 0

Based around the operand table
Detects conditions of transients
When found
append to existing strand
begin new strand

14
Strand Cache
About 175 bytes per line, though very few lines
are needed for effect
status bits
previous reader info
instructions
strand 1
101110101
strand 2
i1
i2
i3
pc
ready
value
strand 3
15
Dispatch Engine
dirtytable
decode
pre-renamed instructions
dispatch engine
strand cache
rename
strands, recovery strands, kill signals,

Watches for strand cache matches
Inserts ready strands into the stream eagerly
Removes component instructions when seen
Correctness checking with dirty table

16
Closed-Loop ALUs
freelocal bypass
ALU
½ cycle
mode switch
full bypass network
½ cycle

Full bypass is half of the execute stage delay
Regular ALUs with double-speed closed-loop mode
two dependent ALU operations in a single cycle
intermediate values (the transients) are
discarded!
final result still takes ½ cycle for full bypass

17
Oops Dirty Read
insert recovery sub-strand to recover R1
load ? 16 R1
R1 is dirty!
18
Oops Anti-Dependence Violation
previous value
R9
insert load immediate of previous value
load ?32 R9
renaming not sufficent outside reorder buffer
safety net
R9 has already been replaced
19
Outline

Motivation
Transient Operands and Strands
Instruction Replacement Hardware
Results
Conclusion

20
Instruction Coverage
Average ALU inst coverage 16 12 1024 27
High coverage rates, but only with a big strand
cache.
Less than a 15 replacement rate, regardless of
cache size
coverage with various strand cache sizes
21
IPC Improvements
Average IPC Speedup 4-wide 17 8-wide 20
Some benchmarks almost double in IPC
Some see almost no speedup at all
4-wide IPC speedup with 16-entry strand cache
22
Resource Occupancy
strand
strand

CISCification of instructions reduces traffic
reorder buffer occupancy is reduced up to 37.
issue queue occupancy is reduced up to 34.
traffic reduction ? coverage
Reduced dependence on dependencies
opportunity for pipelined bypass
opportunity for pipelined issue.

23
Resource Occupancy
strand
strand

Caveat emptor
more worst case issue CAMs
more worst case register ports
Prior work applicable
only 1.2 live inputs / strand

24
Outline

Motivation
Transient Operands and Strands
Instruction Replacement Hardware
Results
Conclusion

25
Conclusion

Key points
eagerly executing macro-instructions ? value
precomputation
limiting focus to transient operands
all new hardware off critical path
Results
IPC speedup of 18-20 with 3KB strand cache
potential for frequency gains
full binary compatibility
Lots of current and future research
relaxed constraint of ALU instructions
quantified frequency improvements
static detection of strands

Questions?
26
Backup Slides
27
Sensitivity to Dispatch Delay
On average, speedup only drops 1 with three
cycles of delay
Some actually get faster due to less errant
strands
Most benchmarks lose a small amount of speedup
4-wide IPC speedup with 16-entry strand cache

Write a Comment

User Comments (0)