Title: Register Renaming
1Register RenamingValue Prediction
2Overview
- Need for Post-RISC
- Register Renaming vs. Allocation Strategies
- How to compile for Post-RISC machines
- Dynamic Register Renaming throughVirtual-Physical
Registers
3Software Outlives Hardware
- How to make old software run faster?
- Faster CPU clock and memory hierarchy
- Adapt CPUs to actual software (profiling/tuning)
- More instructions per cycle
- Todays software will run on tomorrows CPUs
- Need to keep software interface stable
- More functional units and registers
4Compile-time vs. Run-time
- Little is known about software at compile-time
- Space/time trade-offs
- Memory speeds cannot keep up with CPU speeds
- When to apply optimizations that increase
code-size
5Solutions
- New scalable architecture (IA-64)
- Decouple physical/virtual registers using
register windows - More explicit parallelism allows for more
function units - Explicit speculative instructions
- Post-RISC architecture
- Remove limits in super-scalar implementation of
existing architectures - Extract even more parallelism out of existing
software
6Anti- and Output Dependencies
- Also called read-after-write (RAW) hazards
- An instruction may use a result produced by the
previous instruction - Both instructions may not execute simultaneously
in multiple pipelines. - The second instruction must typically be stalled.
7Structural Dependencies
- Stalls results in less than optimal performance
- We may have single-issue cycles, which process
only a single instruction. - Worse, we may have zero-issue cycles, which
initiate no new instructions. - Data dependencies can also limit performance for
a scalar machine - Two cycle memory load/write
- Intra-instruction dependencies
8Scheduling
- Scheduling can remove stalls
- Intra-instruction dependencies cannot be removed
by scheduling (CISC)
9Need for Post-RISC
- Super-scalar has diminishing returns in CPI
(Clocks Per Instruction) - 2-Way ? 1.6 - 1.8 (85)
- 4-Way ? 2.6 (65)
- 8-Way ? ???
- More parallelism needed
- Look beyond set of 4 instructions
10Post-RISC characteristics
- Out-of-order execution
- (Existed 20 years ago on IBM and CDC)
- Innovative for single-chip
- Branch history bits
- Precise interrupts
- Fetch/Flow Prediction
- More caching
- Instruction cache becomes CPU scratch space
- Register renaming
- First in IBM 360/91 FPU
11Specint92 Trends
- Specint92 numbers are increasing
- DEC has historically been the champ
- Specint92/Clock rates
- DEC low (21164_at_300 gt 1.14 10/95)
- IBM strong early (580H_at_55 gt 1.76 9/93)
- HP (PA-8000_at_133 2.7 10/95)
12The Post-RISC Architecture
13Post-RISC CPUs
- Traditional RISC
- DEC Alpha 21164
- Sun UltraSPARC-1
- (partially) Post-RISC
- PowerPC 604
- MIPS R10000
- HP PA-8000
- Intel Pentium Pro
- DEC Alpha 21264
- HAL SPARC64
14Automatic Register Renaming
- Every R-write allocates new R
- The register name A is an alias for the last R
allocated by a write to A - An instruction reading and writing an register
allocates a new R too
15Advantages over More ISA Registers
- Smaller instructions
- Allow same software to run on range of
implementations - Compare the same program running on Pentium or
AMD Ath - Less state to save
- Faster function calls
- Faster context switches
- Life-times can be optimized
16Renaming Implementation
- Rename Storage Locations
- Reorder Buffer
- Physical Register File
- Similarities
- Allocate at decode
- Release at commit
17Renaming using Reorder buffer
- Results are kept in reorder buffer
- Source operands are read either from
- the register file, or
- a reorder buffer entry
- Not-yet-ready results are forwarded to
instruction queue - Used by Intel Pentium III, PowerPC 604, SPARC64
18Renaming on Pentium III
- All registers can be renamed (generic,
floating-point, status) - Renaming uses a set of 40 reorder buffers
- FPU control/status cannot be renamed
- Max 2 renamings per instruction
19Register Allocation Example
- Minimal number of named registers
- Scheduling is limited
- Strictly serial execution
rA Mem1 rA rA rA Mem2 rA rA
Mem3 rA rA 1 Mem4 rA
Mem2 Mem1 Mem1 Mem4 Mem3 1
20Renaming using Physical Register File
- Register file contains more registers than
defined in ISA (logical registers) - Map logical register to physical registers during
decode - Operands are always read from logical file
- Used by MIPS R10000 and DEC 21264
21Virtual-Physical Registers
- Motivation better utilization of physical
registers - Important in presence of long latency
instructions - Conventional scheme wastes register for each
- Decoded instruction that has not finished
execution - Committed instruction whose result is dead
- Can be eliminated by maintaining reference counter
Example load f2,0(r6) fdiv f2,f2,f10 fmul f2,
f2,f12 fadd f2,f2,1
22Virtual-Physical Register Renaming
- General Map Table
- Indexed by logical register L
- VP register last virtual-physical register that
L has been mapped to - P register Last physical register that L and VP
have been mapped to - V-bit indicates whether P is valid
- Physical Map Table
- Has entry for each VP
- Contains last physical register that VP has been
mapped to
23Functional Description
- For each logical source register S do a GMT
lookup - If V-bit is set, rename S to P
- Otherwise, rename S to VP
- Rename the logical destination register to a new
VP - Update GMT set VP to new mapping and reset V
- Save previous VP in reorder buffer to be able to
roll back
24Functional Description
- Instruction Queue Fields
- Operation code
- Destination VP
- Source operands
- Ready-bits for source operands when ready
Source operand contains a physical register
number - Reorder Buffer Entry
- Destination logical register
- Completion bit
- VP mapping of last instruction with same logical
destination
25Functional Description
- When source operands are ready, instruction is
issued - When instruction completes
- new physical register R is allocated for result
- PMT is updated to reflect new mapping
- VP number of destination is broadcast to all
entries in instruction queue with physical
register identifier - GMT is updated entry corresponding to logical
destination is checked for match with the VP and
if so, the physical register nr is copied to the
P register field and the V flag is set - As a result a new instruction using same logical
register will find corresponding physical
register in GMT - Lastly, C flag of entry in reorder buffer is set
26Register Allocation Example
- Uses more named registers
- Scheduling more effective
- 2-way super-scalar execution
rA Mem1 rB Mem3 rA rA rA rB rB
1 Mem2 rA Mem4 rB
Mem2 Mem1 Mem1 Mem4 Mem3 1
27Effect of Register Renaming
- Schedule uses 4 hardware registers
- 2-way super-scalar execution
rA1 Mem1 rB1 Mem3 rA2 rA1 rA1 rB2
rB1 1 Mem2 rA2 Mem4 rB2
28Effect of Register Renaming
- Schedule uses 4 hardware registers
- Can hide memory-write latency
- Still no full use of multiple pipelines
rA1 Mem1 rA2 rA1 rA1 Mem2 rA2 rA3
Mem3 rA4 rA3 1 Mem4 rA4
29Renaming and O-O-O execution
- Instructions wait for
- Availability of execution unit
- Input dependencies
- Older instructions have priority
- Load instructions have priority
- Instructions do NOT wait for
- Program order
- Branch resolution
- Output dependencies
- (use rename register)
30Renaming and O-O-O execution
- Schedule uses 4 hardware registers
- Can hide memory-write latency
- Bad schedule uses both pipelines
- Only one register name used
rA1 Mem1 rA2 rA1 rA1 Mem2 rA2
rA3 Mem3 rA4 rA3 1 Mem4 rA4
31Renaming aware scheduling?
- Use Register Renaming in allocator
- minimal number of named registers
- maximal number of register instances
- Do not do scheduling that CPU can do
- over-scheduling can be worse than no scheduling
at all