Physical Register Inlining PRI - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Physical Register Inlining PRI

Description:

Physical Register Inlining (PRI) Mikko H. Lipasti1, Brian ... Register file caching ... Allocates physical register at decode map table entry is updated ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 27
Provided by: mikkol8
Category:

less

Transcript and Presenter's Notes

Title: Physical Register Inlining PRI


1
Physical Register Inlining (PRI)
  • Mikko H. Lipasti1, Brian Mestan2, and Erika
    Gunadi1
  • 1Department of Electrical and Computer
    Engineering
  • University of WisconsinMadison
  • 2IBM Microelectronics
  • IBM Corporation Austin, TX

http//www.ece.wisc.edu/pharm
2
Demand for Large Register Files
Instruction Window
  • Deeper Pipeline
  • Increasing pressure on Register File
  • Lots of attention / prior work

3
Challenges with Scaling Register Files
  • Additional pipe stages needed for access
  • Increases branch misprediction penalty
  • Increases scheduling misprediction penalty
  • Requires additional bypass logic
  • Further increases pipeline depth
  • Increases the demand for more registers

4
Physical Register Lifetime
width4
width8
  • Managed inefficiently

5
Prior Work
  • Register file caching Swenson et al. 1988,
    Zalamea et al. 2000, Postiff et al. 2001, Cruz et
    al. 2000, Borch et al. 2002
  • Late AllocationGonzalez et al. 1998, Monreal et
    al. 1999
  • Efficient Management
  • Early deallocation Moudgill et al. 1993
  • Program semantics Martin et al. 1997, Lo et al.
    1999
  • Checkpointing Martinez et al. 2002, Akkary et
    al. 2003
  • Value-based optimizationsJourdan et al. 1998

6
Early Deallocation
  • Moudgill et al. 1993
  • Focused on last read to release
  • Avoid waiting for the next writer to commit
  • Deallocate registers as soon as
  • Complete (complete flag)
  • Unmapped (unmap flag)
  • No outstanding readers (reference counter)
  • Still requires next writer to enter the window

7
Physical Register Inlining
  • Exploits narrow operands sizable fraction of
    operands can be stored in less than 8 bits Canal
    et al. 2000
  • Often fewer bits than needed to specify physical
    registers
  • Store the value instead of the pointer
  • Stores narrow values in map table
  • Reduces physical register lifetime

8
Operand Significance
  • Also have FP graph in the paper exploits
    0.0/1.0 (54)

9
Outline
  • Motivation
  • Prior Work
  • Physical Register Inlining
  • Quick Microarchitectural Review
  • Modifications Needed
  • PRI early deallocation
  • Experiments
  • Conclusions

10
Microarchitectural Review
  • Register Rename/Map Tables
  • Maps logical names to physical names
  • Removes false name dependences
  • Two common types RAM and CAM
  • CAM map is positional
  • Not suitable for storing values

.
RAM map
CAM map
0
0
?
Phys reg
1
1
?
2
2
?
Logical reg
Logical reg
L
Phys reg
?
11
Microarchitectural Review
  • Allocating and Freeing Physical Registers
  • Allocates physical register at decode map table
    entry is updated
  • Releases physical register when next writer is
    committed
  • Checkpoint and Recovery of Register Map
  • Optimization to reduce branch misprediction
    penalty

12
Modifications to Data Flow
Dcd
Rnm
Sched
Disp
RF
Exe
Retire
Commit
Fetch
Queue
Map
Payload RAM
ALU
Narrow?
  • Execution stage must allow both operands to be
    read from payload RAM
  • Already supports one immediate operands
  • Sign extension between payload RAM and the ALU
    input
  • Narrow checking logic to verify if the operands
    are narrow
  • Narrow datapath back to the map table

13
Modifications to Map Table
  • Registers freed from the retire/wb stage and
    commit stage
  • Tolerant of duplicate deallocations of the same
    physical register
  • Once as narrow, again at next write commit
  • Map entries need to be writable from rename stage
    and retire/wb stage

14
Stale Pointer Problem
MAP
Checkpoints
PRF
copy
ROB
IssueQ
  • Deallocating physical registers early makes these
    pointers stale
  • Equivalent to the garbage collection issue
  • Two choices
  • Delay deallocation until pointers not valid
    (refcount)
  • Update all pointers (ideal IPC)

15
Map table checkpoints problem
  • Map table checkpoints need to be updated in case
    of narrow operands write
  • Lazy update
  • Complex, but not cycle time critical
  • Checkpoint reference counting
  • Similar to Akkary et al.
  • Delays deallocation, reduces IPC benefit slightly

16
Example of WAR Violation
Load p1 lt MEMp7
And p2 lt p3 p4
narrow
Add p5 lt p1 p2
WAR violation
Or p2 lt p8 p9
  • Rare, but frequent enough to affect performance
  • Must have efficient solution

17
Rename Table WAW Hazards
Decode
Retire
Execute
Commit
Fetch
r3 r1 r2
p4 p1 p2
p5 p1 p2
p4 p1 p2
r3 r1 r2
MAP
ROB (Dst)
p3
r3
p3p4
p3p4p5
p4
p5
WAW!
  • WAW hazards
  • Writes narrow value to a remapped map entry
  • Must ensure that the map entry has not been
    remapped

18
Integrating PRI with Early Deallocation
  • Not all operands are narrow
  • Reduces register lifetime further
  • Adds unmap flags and complete flagsMoudgill et
    al. 1993

width4
baseline
PRI
PRIER
19
Machine Model
  • 4-wide fetch, issue, commit
  • 512 ROB, 256 LSQ
  • 32-entry scheduler
  • 64 physical registers
  • Speculative scheduling with selective recovery
  • Combined bimodal branch predictor
  • 32KB IL1, 32KB DL1, 512KB L2
  • 7 bits PRI for integer, 1 bit PRI for FP

20
Speed Up for Integer Benchmarks
  • PRI (checkpoint reference counting) performs
    substantially better than previous work
  • Reference checkpoint counting scheme performs
    close enough with ideal case (ideal lazy)
  • Combining PRI and ER increases the performance
    further

21
PRF Occupancy for Int. Benchmarks
  • PRI reduces more register file pressure than the
    previous work (ER)
  • Combining PRI and ER reduces the pressure more

22
Speed Up for FP Benchmark
  • Ammp benchmark -gt physical registers are not the
    performance bottleneck
  • Art benchmark -gt a lot of narrow operands to
    exploit
  • Wupwise benchmark -gt few narrow operands

23
Conclusion
  • PRI can lead to substantial performance
    improvement for both integer and fp benchmarks
  • Ideal Update of stale pointers provides marginal
    benefit
  • Reference checkpoint counting is the best choice

24
Future Work
  • Interaction of PRI with delayed register
    allocation (virtual physical register) Gonzalez
    et al. 1998
  • Interaction of PRI with software-based techniques
    to deallocate dead registers
  • PRI enables a binary-compatible mechanism for the
    compiler to communicate the fact that a register
    is dead to the hardware
  • Compiler can simply insert load immediate of
    narrow values to any register that seems dead

25
Questions?
  • Thank you

26
Machine Model
Write a Comment
User Comments (0)
About PowerShow.com