Physical Register Inlining PRI - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Physical Register Inlining PRI

Description:

Physical Register Inlining (PRI) Mikko H. Lipasti1, Brian ... Register file caching ... Allocates physical register at decode map table entry is updated ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 27

Provided by: mikkol8

Category:

more less

Transcript and Presenter's Notes

Title: Physical Register Inlining PRI

1
Physical Register Inlining (PRI)

Mikko H. Lipasti1, Brian Mestan2, and Erika
Gunadi1
1Department of Electrical and Computer
Engineering
University of WisconsinMadison
2IBM Microelectronics
IBM Corporation Austin, TX

http//www.ece.wisc.edu/pharm
2
Demand for Large Register Files
Instruction Window

Deeper Pipeline
Increasing pressure on Register File
Lots of attention / prior work

3
Challenges with Scaling Register Files

Additional pipe stages needed for access
Increases branch misprediction penalty
Increases scheduling misprediction penalty
Requires additional bypass logic
Further increases pipeline depth
Increases the demand for more registers

4
Physical Register Lifetime
width4
width8

Managed inefficiently

5
Prior Work

Register file caching Swenson et al. 1988,
Zalamea et al. 2000, Postiff et al. 2001, Cruz et
al. 2000, Borch et al. 2002
Late AllocationGonzalez et al. 1998, Monreal et
al. 1999
Efficient Management
Early deallocation Moudgill et al. 1993
Program semantics Martin et al. 1997, Lo et al.
1999
Checkpointing Martinez et al. 2002, Akkary et
al. 2003
Value-based optimizationsJourdan et al. 1998

6
Early Deallocation

Moudgill et al. 1993
Focused on last read to release
Avoid waiting for the next writer to commit
Deallocate registers as soon as
Complete (complete flag)
Unmapped (unmap flag)
No outstanding readers (reference counter)
Still requires next writer to enter the window

7
Physical Register Inlining

Exploits narrow operands sizable fraction of
operands can be stored in less than 8 bits Canal
et al. 2000
Often fewer bits than needed to specify physical
registers
Store the value instead of the pointer
Stores narrow values in map table
Reduces physical register lifetime

8
Operand Significance

Also have FP graph in the paper exploits
0.0/1.0 (54)

9
Outline

Motivation
Prior Work
Physical Register Inlining
Quick Microarchitectural Review
Modifications Needed
PRI early deallocation
Experiments
Conclusions

10
Microarchitectural Review

Register Rename/Map Tables
Maps logical names to physical names
Removes false name dependences
Two common types RAM and CAM
CAM map is positional
Not suitable for storing values

.
RAM map
CAM map
0
0
?
Phys reg
1
1
?
2
2
?
Logical reg
Logical reg
L
Phys reg
?
11
Microarchitectural Review

Allocating and Freeing Physical Registers
Allocates physical register at decode map table
entry is updated
Releases physical register when next writer is
committed
Checkpoint and Recovery of Register Map
Optimization to reduce branch misprediction
penalty

12
Modifications to Data Flow
Dcd
Rnm
Sched
Disp
RF
Exe
Retire
Commit
Fetch
Queue
Map
Payload RAM
ALU
Narrow?

Execution stage must allow both operands to be
read from payload RAM
Already supports one immediate operands
Sign extension between payload RAM and the ALU
input
Narrow checking logic to verify if the operands
are narrow
Narrow datapath back to the map table

13
Modifications to Map Table

Registers freed from the retire/wb stage and
commit stage
Tolerant of duplicate deallocations of the same
physical register
Once as narrow, again at next write commit
Map entries need to be writable from rename stage
and retire/wb stage

14
Stale Pointer Problem
MAP
Checkpoints
PRF
copy
ROB
IssueQ

Deallocating physical registers early makes these
pointers stale
Equivalent to the garbage collection issue
Two choices
Delay deallocation until pointers not valid
(refcount)
Update all pointers (ideal IPC)

15
Map table checkpoints problem

Map table checkpoints need to be updated in case
of narrow operands write
Lazy update
Complex, but not cycle time critical
Checkpoint reference counting
Similar to Akkary et al.
Delays deallocation, reduces IPC benefit slightly

16
Example of WAR Violation
Load p1 lt MEMp7
And p2 lt p3 p4
narrow
Add p5 lt p1 p2
WAR violation
Or p2 lt p8 p9

Rare, but frequent enough to affect performance
Must have efficient solution

17
Rename Table WAW Hazards
Decode
Retire
Execute
Commit
Fetch
r3 r1 r2
p4 p1 p2
p5 p1 p2
p4 p1 p2
r3 r1 r2
MAP
ROB (Dst)
p3
r3
p3p4
p3p4p5
p4
p5
WAW!

WAW hazards
Writes narrow value to a remapped map entry
Must ensure that the map entry has not been
remapped

18
Integrating PRI with Early Deallocation

Not all operands are narrow
Reduces register lifetime further
Adds unmap flags and complete flagsMoudgill et
al. 1993

width4
baseline
PRI
PRIER
19
Machine Model

4-wide fetch, issue, commit
512 ROB, 256 LSQ
32-entry scheduler
64 physical registers
Speculative scheduling with selective recovery
Combined bimodal branch predictor
32KB IL1, 32KB DL1, 512KB L2
7 bits PRI for integer, 1 bit PRI for FP

20
Speed Up for Integer Benchmarks

PRI (checkpoint reference counting) performs
substantially better than previous work
Reference checkpoint counting scheme performs
close enough with ideal case (ideal lazy)
Combining PRI and ER increases the performance
further

21
PRF Occupancy for Int. Benchmarks

PRI reduces more register file pressure than the
previous work (ER)
Combining PRI and ER reduces the pressure more

22
Speed Up for FP Benchmark

Ammp benchmark -gt physical registers are not the
performance bottleneck
Art benchmark -gt a lot of narrow operands to
exploit
Wupwise benchmark -gt few narrow operands

23
Conclusion

PRI can lead to substantial performance
improvement for both integer and fp benchmarks
Ideal Update of stale pointers provides marginal
benefit
Reference checkpoint counting is the best choice

24
Future Work

Interaction of PRI with delayed register
allocation (virtual physical register) Gonzalez
et al. 1998
Interaction of PRI with software-based techniques
to deallocate dead registers
PRI enables a binary-compatible mechanism for the
compiler to communicate the fact that a register
is dead to the hardware
Compiler can simply insert load immediate of
narrow values to any register that seems dead

25
Questions?