by Martin Labrecque - PowerPoint PPT Presentation

About This Presentation
Title:

by Martin Labrecque

Description:

Activation record of a thread of execution ... (Alpha's register remapping with issue queues) Window overflow/underflow: 10 cycles ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 37
Provided by: eecgTo
Category:

less

Transcript and Presenter's Notes

Title: by Martin Labrecque


1
by Martin Labrecque
  • How to Fake 1000 Registers
  • Oehmke, Binkert, Mudge, Reinhart
  • to appear in Nov _at_ Micro 2005

2
Outline
  • Motivation
  • Observations on registers
  • Idea
  • Virtual Context Architecture
  • Evaluation in 2 types of applications

3
Some definitions
  • Activation record
  • Data structure
  • variables belonging to one particular scope
    (e.g. a procedure body)
  • links to other activation records
  • Synonyms "data frame", "stack frame"
  • Context
  • Activation record of a thread of execution

A register is only meaningful to the current
activation record
4
Key observation
  • Virtual Memory
  • For the ISA standpoint each process has an
    'infinite' amount of memory available
  • Memory is managed in caches, RAM and disk
  • Memory is context free
  • This is not true for registers
  • Limited resource

Need to virtualize registers
5
How registers are used
Source code variables
Compiler
IR virtual registers
Register allocation
Binary logical registers
Decode/Rename
Data path physical registers
Pipeline
6
Registers are useful
  • Can't get rid of registers
  • Efficient address encoding in instructions
  • Unambiguous data dependences
  • Efficient integration in the micro-architecture

7
Dawn of a New Idea
Attach a memory address to the content of the
register!
8
Virtualizing registers
9
Mapping registers to memory
  • Registers are virtualized because they hold the
    content of a memory location
  • 2 options
  • At register allocation, map compiler virtual
    registers to memory
  • Memory to memory operations
  • Doesn't make use of ISA registers
  • Map ISA registers to memory
  • Key Idea of the Virtual Context Architecture

10
Programming the VCA
  • Where are the registers mapped in memory?
  • The Stack Pointer is the Reference
  • Allows to 'allocate' memory dynamically
  • Efficient way of passing parameters to a a
    function
  • Need some architectural support to address with
    offsets to the stack pointer

11
Renaming
  • To get the register memory address, combine
  • the source/destination register index of the
    binary program
  • base pointer (stack pointer)
  • ISA register index ? register memory address ?
    physical register

12
Register memory address ? physical reg.
  • The address base pointer offset
  • Exploit locality of the addresses to compress the
    number of bits in the conversion, low probability
    of capacity miss

13
Register File is a Cache
  • Hardware controlled cache
  • An instruction requires its source operands and
    destination register to execute

What happens on a cache miss? We need some
hardware control!
14
Some additional HW
  • Each register has 3 new attributes
  • A reference count
  • Incremented when instruction using it goes
    through rename
  • Decremented when instruction is committed
  • Non zero value means that register cannot be
    reallocated to other logical registers
  • Guarantees instruction correct execution

15
Some additionnal HW (ctnd)
  • A 'committed' bit
  • Valid, non speculative value
  • A 'dirty' bit
  • Value more up-to-date than memory
  • Using those attributes, a state machine controls
    which registers are available or not
  • Branch recovery works by having a duplicate
    renaming table containing the committed
    architectural state

16
Source operand to physical registerconversion
17
Destination logical register to physical register
conversion
18
Allocation of an entry for destination register
  • Replacement policy in rename table

19
Pipeline modifications
  • Changes in the renaming
  • ATSQ architectural state transfer queue
  • Adds to the queue upon fills and spills
  • Has priority on the instruction to execute
  • Addresses for fills and spills are pre-calculated
  • No memory disambiguation required
  • No data dependences

20
Outline
  • Motivation
  • Observations on registers
  • Idea
  • Virtual Context Architecture
  • Evaluation in 2 types of applications
  • Baseline Methodology
  • Register windows w/ results
  • SMT w/ results
  • Combined register windows SMT

21
Baseline machine
22
More on methodology
  • Uses SimPoints to find representative simulation
    intervals
  • SPEC CPU 2000
  • Baseline doesn't have register windows
  • (Alphas register remapping with issue queues)
  • Window overflow/underflow 10 cycles

23
Applications
  • Register windows
  • Multithreading

http//www.sics.se/psm/sparcstack.html
http//en.wikipedia.org/wiki/Register_window
24
Register Windows
  • Global register allocation
  • How many registers should we reserve for the
    current procedure versus the rest of the program?
  • SPARC example
  • usually contains as many as 128 GPRs
  • At any point only 32 are available
  • 8 global, 8 params in, 8 params out, 8 local
    values
  • Up to 32 windows
  • Windows changed by an instruction usually along
    with 'call' and 'return'
  • Partial overlap 'params out' of caller are
    'params in' of callee
  • Also used in Itanium (variable sized window)
  • Alternative is e.g. renaming with reservation
    stations

Save some memory (stack) traffic on function calls
25
Register Windows Caveats
  • Problem
  • Overflow of windows call depth too deep
  • Underflow of window need to restore a window
    from memory
  • Solution
  • Operating system handler
  • typical scheme saves and restores windows
  • VCA handles registers individually

Performance Advantage of the Register Stack in
Intel Itanium Processors
26
Register windows evaluation
  • Ideal fills and spills are free
  • VCA is especially good with few registers
  • Close to ideal at 256 registers
  • VCA 4 faster than baseline _at_256 regs
  • Less registers means less in-flight instructions
    and less branch misprediction ?increase
  • For others ? decrease

27
Single data cache port experiment
  • Normalized to 2-port baseline
  • 7 faster than baseline _at_ 256 regs
  • 0.5 slower than ideal _at_ 256 regs

28
2nd Appmulti-threading
29
SMT simultaneous multi-threading
  • Lots of replicated resources (larger register
    file)
  • VCA renaming table is not replicated, only base
    thread pointer
  • VCA
  • of in-flight instructions determine number of
    registers required
  • not of threads

30
SMT 2 and 4 threads
  • Normalized to single thread baseline 256 regs
    (not shown)
  • _at_ 192 regs, VCA 2T is 97 of baseline _at_ 320 regs
    (baseline is at 88)
  • _at_192 regs, VCA 4T is at 98.7 of baseline _at_448
    regs

31
CombinedSMT w/ register windows
  • Normalized to single thread baseline _at_ 256 regs
  • VCA 4T 98 of peak performance _at_ 192 regs

32
SMT register windows
  • Register window reduces cache accesses while SMT
    increases them
  • VCA 4T non-windowed _at_192 regs is 98 perf. of
    baseline, it still has 24 more cache accesses,
    adding windows makes cache accesses 5 below
    baseline

33
VCA summarized
  • unifies support for both multiple independent
    threads and register windowing within each
    thread
  • backwards compatible with existing ISAs at the
    application level for multithreaded contexts
  • requires only minimal ISA changes for register
    windowing
  • requires no changes to the physical register file
    design and the performance-critical
    schedule/execute/writeback loop
  • builds on existing rename logic to map logical
    registers to physical registers and handles
    register cache misses in the decode/rename stages

34
VCA summarized (ctnd)
  • completely decouples physical register file size
    from the number of logical registers by using
    memory as a backing store, rather than another
    larger register file
  • does not involve speculation or prediction,
    avoiding the need for recovery mechanisms.

35
Conclusions
  • A VCA-based implementation of register windows in
    an out-of-order processor reduces execution time
    by 4 while reducing data cache accesses by
    nearly 20 compared to a non-windowed machine,
    with an even larger performance advantage over a
    conventional register-window implementation.
  • VCA's data cache traffic reduction is large
    enough that it can achieve the same performance
    with one cache port as an otherwise similar
    conventional machine would with two cache ports.

36
Conclusions (ctnd)
  • VCA is also able to manage thread contexts
    efficiently, enabling effective implementation of
    simultaneous multithreading (SMT) using as few as
    half the registers of a standard architecture.
  • VCA allows SMT to be combined with register
    windows with no additional physical registers.
  • a 4-thread VCA machine with 192 registers can
    achieve higher performance than a conventional
    non-windowed SMT machine with twice as many
    registers.
Write a Comment
User Comments (0)
About PowerShow.com