by Martin Labrecque

About This Presentation

Title:

by Martin Labrecque

Description:

Activation record of a thread of execution ... (Alpha's register remapping with issue queues) Window overflow/underflow: 10 cycles ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 37

Provided by: eecgTo

Category:

more less

Transcript and Presenter's Notes

Title: by Martin Labrecque

1
by Martin Labrecque

How to Fake 1000 Registers
Oehmke, Binkert, Mudge, Reinhart
to appear in Nov _at_ Micro 2005

2
Outline

Motivation
Observations on registers
Idea
Virtual Context Architecture
Evaluation in 2 types of applications

3
Some definitions

Activation record
Data structure
variables belonging to one particular scope
(e.g. a procedure body)
links to other activation records
Synonyms "data frame", "stack frame"
Context
Activation record of a thread of execution

A register is only meaningful to the current
activation record
4
Key observation

Virtual Memory
For the ISA standpoint each process has an
'infinite' amount of memory available
Memory is managed in caches, RAM and disk
Memory is context free
This is not true for registers
Limited resource

Need to virtualize registers
5
How registers are used
Source code variables
Compiler
IR virtual registers
Register allocation
Binary logical registers
Decode/Rename
Data path physical registers
Pipeline
6
Registers are useful

Can't get rid of registers
Efficient address encoding in instructions
Unambiguous data dependences
Efficient integration in the micro-architecture

7
Dawn of a New Idea
Attach a memory address to the content of the
register!
8
Virtualizing registers
9
Mapping registers to memory

Registers are virtualized because they hold the
content of a memory location
2 options
At register allocation, map compiler virtual
registers to memory
Memory to memory operations
Doesn't make use of ISA registers
Map ISA registers to memory
Key Idea of the Virtual Context Architecture

10
Programming the VCA

Where are the registers mapped in memory?
The Stack Pointer is the Reference
Allows to 'allocate' memory dynamically
Efficient way of passing parameters to a a
function
Need some architectural support to address with
offsets to the stack pointer

11
Renaming

To get the register memory address, combine
the source/destination register index of the
binary program
base pointer (stack pointer)
ISA register index ? register memory address ?
physical register

12
Register memory address ? physical reg.

The address base pointer offset
Exploit locality of the addresses to compress the
number of bits in the conversion, low probability
of capacity miss

13
Register File is a Cache

Hardware controlled cache
An instruction requires its source operands and
destination register to execute

What happens on a cache miss? We need some
hardware control!
14
Some additional HW

Each register has 3 new attributes
A reference count
Incremented when instruction using it goes
through rename
Decremented when instruction is committed
Non zero value means that register cannot be
reallocated to other logical registers
Guarantees instruction correct execution

15
Some additionnal HW (ctnd)

A 'committed' bit
Valid, non speculative value
A 'dirty' bit
Value more up-to-date than memory

Using those attributes, a state machine controls
which registers are available or not
Branch recovery works by having a duplicate
renaming table containing the committed
architectural state

16
Source operand to physical registerconversion
17
Destination logical register to physical register
conversion
18
Allocation of an entry for destination register

Replacement policy in rename table

19
Pipeline modifications

Changes in the renaming
ATSQ architectural state transfer queue
Adds to the queue upon fills and spills
Has priority on the instruction to execute
Addresses for fills and spills are pre-calculated
No memory disambiguation required
No data dependences

20
Outline

Motivation
Observations on registers
Idea
Virtual Context Architecture
Evaluation in 2 types of applications
Baseline Methodology
Register windows w/ results
SMT w/ results
Combined register windows SMT

21
Baseline machine
22
More on methodology

Uses SimPoints to find representative simulation
intervals
SPEC CPU 2000
Baseline doesn't have register windows
(Alphas register remapping with issue queues)
Window overflow/underflow 10 cycles

23
Applications

Register windows
Multithreading

http//www.sics.se/psm/sparcstack.html
http//en.wikipedia.org/wiki/Register_window
24
Register Windows

Global register allocation
How many registers should we reserve for the
current procedure versus the rest of the program?
SPARC example
usually contains as many as 128 GPRs
At any point only 32 are available
8 global, 8 params in, 8 params out, 8 local
values
Up to 32 windows
Windows changed by an instruction usually along
with 'call' and 'return'
Partial overlap 'params out' of caller are
'params in' of callee
Also used in Itanium (variable sized window)
Alternative is e.g. renaming with reservation
stations

Save some memory (stack) traffic on function calls
25
Register Windows Caveats

Problem
Overflow of windows call depth too deep
Underflow of window need to restore a window
from memory
Solution
Operating system handler
typical scheme saves and restores windows
VCA handles registers individually

Performance Advantage of the Register Stack in
Intel Itanium Processors
26
Register windows evaluation

Ideal fills and spills are free
VCA is especially good with few registers
Close to ideal at 256 registers
VCA 4 faster than baseline _at_256 regs

Less registers means less in-flight instructions
and less branch misprediction ?increase
For others ? decrease

27
Single data cache port experiment

Normalized to 2-port baseline
7 faster than baseline _at_ 256 regs
0.5 slower than ideal _at_ 256 regs

28
2nd Appmulti-threading
29
SMT simultaneous multi-threading

Lots of replicated resources (larger register
file)
VCA renaming table is not replicated, only base
thread pointer
VCA
of in-flight instructions determine number of
registers required
not of threads

30
SMT 2 and 4 threads

Normalized to single thread baseline 256 regs
(not shown)
_at_ 192 regs, VCA 2T is 97 of baseline _at_ 320 regs
(baseline is at 88)
_at_192 regs, VCA 4T is at 98.7 of baseline _at_448
regs

31
CombinedSMT w/ register windows

Normalized to single thread baseline _at_ 256 regs
VCA 4T 98 of peak performance _at_ 192 regs

32
SMT register windows

Register window reduces cache accesses while SMT
increases them
VCA 4T non-windowed _at_192 regs is 98 perf. of
baseline, it still has 24 more cache accesses,
adding windows makes cache accesses 5 below
baseline

33
VCA summarized

unifies support for both multiple independent
threads and register windowing within each
thread
backwards compatible with existing ISAs at the
application level for multithreaded contexts
requires only minimal ISA changes for register
windowing
requires no changes to the physical register file
design and the performance-critical
schedule/execute/writeback loop
builds on existing rename logic to map logical
registers to physical registers and handles
register cache misses in the decode/rename stages

34
VCA summarized (ctnd)

completely decouples physical register file size
from the number of logical registers by using
memory as a backing store, rather than another
larger register file
does not involve speculation or prediction,
avoiding the need for recovery mechanisms.

35
Conclusions

A VCA-based implementation of register windows in
an out-of-order processor reduces execution time
by 4 while reducing data cache accesses by
nearly 20 compared to a non-windowed machine,
with an even larger performance advantage over a
conventional register-window implementation.
VCA's data cache traffic reduction is large
enough that it can achieve the same performance
with one cache port as an otherwise similar
conventional machine would with two cache ports.

36
Conclusions (ctnd)

VCA is also able to manage thread contexts
efficiently, enabling effective implementation of
simultaneous multithreading (SMT) using as few as
half the registers of a standard architecture.
VCA allows SMT to be combined with register
windows with no additional physical registers.
a 4-thread VCA machine with 192 registers can
achieve higher performance than a conventional
non-windowed SMT machine with twice as many
registers.

Write a Comment

User Comments (0)

About PowerShow.com

by Martin Labrecque - PowerPoint PPT Presentation

by Martin Labrecque

Activation record of a thread of execution ... (Alpha's register remapping with issue queues) Window overflow/underflow: 10 cycles ... – PowerPoint PPT presentation