Title: Physical Register Inlining PRI
1Physical Register Inlining (PRI)
- Mikko H. Lipasti1, Brian Mestan2, and Erika
Gunadi1 - 1Department of Electrical and Computer
Engineering - University of WisconsinMadison
- 2IBM Microelectronics
- IBM Corporation Austin, TX
http//www.ece.wisc.edu/pharm
2Demand for Large Register Files
Instruction Window
- Deeper Pipeline
- Increasing pressure on Register File
- Lots of attention / prior work
3Challenges with Scaling Register Files
- Additional pipe stages needed for access
- Increases branch misprediction penalty
- Increases scheduling misprediction penalty
- Requires additional bypass logic
- Further increases pipeline depth
- Increases the demand for more registers
4Physical Register Lifetime
width4
width8
5Prior Work
- Register file caching Swenson et al. 1988,
Zalamea et al. 2000, Postiff et al. 2001, Cruz et
al. 2000, Borch et al. 2002 - Late AllocationGonzalez et al. 1998, Monreal et
al. 1999 - Efficient Management
- Early deallocation Moudgill et al. 1993
- Program semantics Martin et al. 1997, Lo et al.
1999 - Checkpointing Martinez et al. 2002, Akkary et
al. 2003 - Value-based optimizationsJourdan et al. 1998
6Early Deallocation
- Moudgill et al. 1993
- Focused on last read to release
- Avoid waiting for the next writer to commit
- Deallocate registers as soon as
- Complete (complete flag)
- Unmapped (unmap flag)
- No outstanding readers (reference counter)
- Still requires next writer to enter the window
7Physical Register Inlining
- Exploits narrow operands sizable fraction of
operands can be stored in less than 8 bits Canal
et al. 2000 - Often fewer bits than needed to specify physical
registers - Store the value instead of the pointer
- Stores narrow values in map table
- Reduces physical register lifetime
8Operand Significance
- Also have FP graph in the paper exploits
0.0/1.0 (54)
9Outline
- Motivation
- Prior Work
- Physical Register Inlining
- Quick Microarchitectural Review
- Modifications Needed
- PRI early deallocation
- Experiments
- Conclusions
10Microarchitectural Review
- Register Rename/Map Tables
- Maps logical names to physical names
- Removes false name dependences
- Two common types RAM and CAM
- CAM map is positional
- Not suitable for storing values
.
RAM map
CAM map
0
0
?
Phys reg
1
1
?
2
2
?
Logical reg
Logical reg
L
Phys reg
?
11Microarchitectural Review
- Allocating and Freeing Physical Registers
- Allocates physical register at decode map table
entry is updated - Releases physical register when next writer is
committed - Checkpoint and Recovery of Register Map
- Optimization to reduce branch misprediction
penalty
12Modifications to Data Flow
Dcd
Rnm
Sched
Disp
RF
Exe
Retire
Commit
Fetch
Queue
Map
Payload RAM
ALU
Narrow?
- Execution stage must allow both operands to be
read from payload RAM - Already supports one immediate operands
- Sign extension between payload RAM and the ALU
input - Narrow checking logic to verify if the operands
are narrow - Narrow datapath back to the map table
13Modifications to Map Table
- Registers freed from the retire/wb stage and
commit stage - Tolerant of duplicate deallocations of the same
physical register - Once as narrow, again at next write commit
- Map entries need to be writable from rename stage
and retire/wb stage
14Stale Pointer Problem
MAP
Checkpoints
PRF
copy
ROB
IssueQ
- Deallocating physical registers early makes these
pointers stale - Equivalent to the garbage collection issue
- Two choices
- Delay deallocation until pointers not valid
(refcount) - Update all pointers (ideal IPC)
15Map table checkpoints problem
- Map table checkpoints need to be updated in case
of narrow operands write - Lazy update
- Complex, but not cycle time critical
- Checkpoint reference counting
- Similar to Akkary et al.
- Delays deallocation, reduces IPC benefit slightly
16Example of WAR Violation
Load p1 lt MEMp7
And p2 lt p3 p4
narrow
Add p5 lt p1 p2
WAR violation
Or p2 lt p8 p9
- Rare, but frequent enough to affect performance
- Must have efficient solution
17Rename Table WAW Hazards
Decode
Retire
Execute
Commit
Fetch
r3 r1 r2
p4 p1 p2
p5 p1 p2
p4 p1 p2
r3 r1 r2
MAP
ROB (Dst)
p3
r3
p3p4
p3p4p5
p4
p5
WAW!
- WAW hazards
- Writes narrow value to a remapped map entry
- Must ensure that the map entry has not been
remapped
18Integrating PRI with Early Deallocation
- Not all operands are narrow
- Reduces register lifetime further
- Adds unmap flags and complete flagsMoudgill et
al. 1993
width4
baseline
PRI
PRIER
19Machine Model
- 4-wide fetch, issue, commit
- 512 ROB, 256 LSQ
- 32-entry scheduler
- 64 physical registers
- Speculative scheduling with selective recovery
- Combined bimodal branch predictor
- 32KB IL1, 32KB DL1, 512KB L2
- 7 bits PRI for integer, 1 bit PRI for FP
20Speed Up for Integer Benchmarks
- PRI (checkpoint reference counting) performs
substantially better than previous work - Reference checkpoint counting scheme performs
close enough with ideal case (ideal lazy) - Combining PRI and ER increases the performance
further
21PRF Occupancy for Int. Benchmarks
- PRI reduces more register file pressure than the
previous work (ER) - Combining PRI and ER reduces the pressure more
22Speed Up for FP Benchmark
- Ammp benchmark -gt physical registers are not the
performance bottleneck - Art benchmark -gt a lot of narrow operands to
exploit - Wupwise benchmark -gt few narrow operands
23Conclusion
- PRI can lead to substantial performance
improvement for both integer and fp benchmarks - Ideal Update of stale pointers provides marginal
benefit - Reference checkpoint counting is the best choice
24Future Work
- Interaction of PRI with delayed register
allocation (virtual physical register) Gonzalez
et al. 1998 - Interaction of PRI with software-based techniques
to deallocate dead registers - PRI enables a binary-compatible mechanism for the
compiler to communicate the fact that a register
is dead to the hardware - Compiler can simply insert load immediate of
narrow values to any register that seems dead
25Questions?
26Machine Model