IA64 Register Model: Stack - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

IA64 Register Model: Stack

Description:

dy[i] = dy[i] (da * dx[i]) 2 loads, 1 fma, 1 store / iteration. Machine assumptions ... load dx,dy. tmp = dy da * dx. store dy .rotf dx[3], dy[3], tmp[2] ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 50
Provided by: jwax4
Category:

less

Transcript and Presenter's Notes

Title: IA64 Register Model: Stack


1
IA-64 Register Model Stack Rotation
  • Dale Morris
  • Architect
  • Hewlett Packard Co.

2
Philosophy
  • Large files
  • Most processors have lots of registers
  • Explicit control over register-renaming
  • Most processors have register renaming
  • IA-64 makes the register names SW-visible makes
    the renaming explicit

3
Outline
  • Register Stack
  • Register Stack Engine
  • Register Rotation
  • Loop Branches
  • Modulo-Scheduling of Loops
  • Summary

4
Register Stack
  • Motivation
  • Automatic save/restore of GRs on procedure
    call/return
  • Cache traffic reduction
  • Latency hiding of register spill/fill

5
General Registers
6
GR Stack Frame
size of frame (sof)
size of locals (sol)
Current Frame Marker (CFM)
7
GR Stack Frame - Example
8
GR Stack Frame - Call
9
GR Stack Frame - Allocate
inputs
10
GR Stack Frame - Return
11
Instructions
  • br.call
  • Copies CFM to PFM
  • Creates new frame with only output regs
  • Saves local regs from previous frame
  • alloc
  • Resizes current frame
  • Saves PFM to a GR

12
Instructions (cont.)
  • mov to PFS
  • Restores PFM from a GR
  • br.ret
  • Restores CFM from PFM
  • Restores local regs for previous frame

13
Leaf Procedure Optimization
  • No need to save/restore PFM
  • Can always use scratch static GRs
  • Can omit alloc if
  • Not many registers needed
  • Register rotation not needed

14
Register Save Engine
  • Automatically spills/fills registers from memory
    as needed
  • Registers saved on a Backing Store Stack
  • Spills/fills NaT bits as well

15
Reg Stack Backing Store
A calls B calls C
current frame
call
unallocated
procC
sofc
procB
procB
solb
RSE loads/ stores
procA
procA
sola
procAs ancestors
unallocated
return
Physical stacked registers
Backing Store
16
Register Stack Summary
  • Exposes register renaming to SW
  • Avoids register spill when few needed
  • Hides register spill/fill
  • Programmable sizes
  • only use as many registers as you need

17
Outline
  • Register Stack
  • Register Stack Engine
  • Register Rotation
  • Loop Branches
  • Modulo-Scheduling of Loops
  • Summary

18
Register Rotation
  • Motivation
  • pipeline-schedule loops onto HW
  • remove extraneous work from loop
  • minimize start-up overhead
  • small code footprint
  • maximum computational throughput with few
    instructions

19
GR Stack Frame w/ Rotation
127
sof
outputs
sol
locals
Size of Rotating (sor)
32
31
Static
0
Current Frame Marker (CFM)
sof
sol
sor
rrb.gr
rrb.fr
rrb.pr
20
GR Rotation
  • Size of rotating region multiple of 8
  • Rotating region overlays current frame
  • Starts at r32
  • Overlay allows rotation stack renaming in a
    single level of adders
  • Must copy input registers before loop

21
FR Rotation
127
Rotating
Upper 3/4 of register file rotates
32
31
Static
0
22
Predicate Rotation
63
Rotating
Upper 3/4 of register file rotates
16
15
Static
0
23
Register Rotation RRB
  • Separate Rotating Register Base for each GRs,
    FRs, PRs
  • Loop branches decrement all register rotating
    bases (RRB)
  • Instructions contain a virtual register number
  • RRB virtual register number physical register
    number.

. . .
36
Palm
35
34
33
32
. . .
Palm
Sunny
is
Springs
RRB0
24
Register Rotation RRB
  • Separate Rotating Register Base for each GRs,
    FRs, PRs
  • Loop branches decrement all register rotating
    bases (RRB)
  • Instructions contain a virtual register number
  • RRB virtual register number physical register
    number.

IA-64
Palm
. . .
36
Palm
35
Springs
34
33
32
. . .
Palm
Sunny
is
Springs
RRB0
25
Register Rotation RRB
  • Separate Rotating Register Base for each GRs,
    FRs, PRs
  • Loop branches decrement all register rotating
    bases (RRB)
  • Instructions contain a virtual register number
  • RRB virtual register number physical register
    number.

IA-64
Palm
Springs
. . .
Palm
35
Springs
34
is
33
32
127
. . .
Palm
Sunny
is
Springs
RRB-1
26
Register Rotation RRB
  • Separate Rotating Register Base for each GRs,
    FRs, PRs
  • Loop branches decrement all register rotating
    bases (RRB)
  • Instructions contain a virtual register number
  • RRB virtual register number physical register
    number.

IA-64
Palm
Springs
is
. . .
Springs
34
is
33
Sunny
32
127
126
. . .
Palm
Sunny
is
Springs
RRB-2
27
Register Rotation RRB
  • Separate Rotating Register Base for each GRs,
    FRs, PRs
  • Loop branches decrement all register rotating
    bases (RRB)
  • Instructions contain a virtual register number
  • RRB virtual register number physical register
    number.

IA-64
Palm
Springs
is
Sunny
. . .
is
33
Sunny
32
127
126
125
. . .
Palm
Sunny
is
Springs
RRB-3
28
Loop Branches
  • br.cloop uses LC for simple, non-pipelined loops
  • decrements LC and loops until LC is 0
  • br.ctop uses LC and EC for pipelined counted
    loops
  • br.wtop uses branch predicate and EC for
    pipelined while loops
  • br.cexit, br.wexit used for unrolled, pipelined
    loops

29
br.ctop
  • Function (simplified)
  • if (LCgt0) LC-- pr631 rrb--
    loopelse if (ECgt1) EC--
    pr630 rrb-- loopelse EC--
    pr630 rrb-- fall_through
  • LC counts main loop iterations
  • EC counts pipeline stages for drain

30
Software Pipelining
  • Overlapping execution of different loop iterations

vs.
  • More iterations in same amount of time

31
Software Pipelining
  • Traditional architectures use loop unrolling
  • High overhead extra code for loop body,
    prologue, and epilogue
  • Synergistic use of IA-64 features
  • Full Predication
  • Special branches
  • Register rotation removes loop copy overhead
  • Predicate rotation removes prologue epilogue

Especially Useful for Integer Code With Small
Number of Loop Iterations
32
Pipelined Loop Example
  • DAXPY inner loop
  • dyi dyi (da dxi)
  • 2 loads, 1 fma, 1 store / iteration
  • Machine assumptions
  • can do 2 loads, 1 store, 1 fma, 1 br / cycle
  • load latency of 2 clocks
  • fma latency of 1 clocks

33
Example Pipeline
  • Each column represents 1 source iteration

load dx,dy
tmp dy da dx
store dy
34
Example Code
.rotf dx3, dy3, tmp2 mov ar.lc 3 //
iterations-1 mov ar.ec 4 //
stages mov pr.rot 0x10000 looptop
(p16) ldfd dx0 dxsp,8 (p16) ldfd dy0
dysp,8 (p18) fma.d tmp0 da, dx2, dy2
(p19) stfd dydp tmp1,8 br.ctop
looptop
35
Loop Execution
Execution Sequence
(p16) ldx (p16) ldy (p18) fma (p19) st
(p19)
(p18)
(p16)
(p63)
LC3 EC4
RRB0
Initialization
36
Loop Execution
(p63)
LC2 EC4
RRB-1
37
Loop Execution
(p63)
LC1 EC4
RRB-2
38
Loop Execution
(p63)
LC0 EC4
RRB-3
39
Loop Execution
(p63)
LC0 EC3
RRB-4
40
Loop Execution
(p63)
LC0 EC2
RRB-5
41
Loop Execution
(p63)
LC0 EC1
RRB-6
42
Loop Execution
(p63)
LC0 EC0
RRB-7
43
Pipelining Latency
  • Suppose we change the latencies
  • load latency of 6 clocks
  • fma latency of 4 clocks

44
Example New Pipeline
  • Each column represents 1 source iteration

load dx,dy
tmp dy da dx
store dy
45
Updated Loop
.rotf dx7, dy7, tmp5 mov ar.lc 3 //
iterations-1 mov ar.ec 11 //
stages mov pr.rot 0x10000 looptop
(p16) ldfd dx0 dxsp,8 (p16) ldfd dy0
dysp,8 (p22) fma.d tmp0 da, dx6, dy6
(p26) stfd dydp tmp4,8 br.ctop
looptop
46
Rotation Summary
  • Loop pipelining maximizes performance minimizes
    overhead
  • Avoids code expansion of unrolling and code
    explosion of prologue and epilogue
  • Smaller code means fewer cache misses
  • Greater performance improvements in higher
    latency conditions
  • Reduced overhead allows S/W pipelining of small
    loops with unknown trip counts
  • Typical of integer scalar codes

47
Outline
  • Register Stack
  • Register Stack Engine
  • Register Rotation
  • Loop Branches
  • Modulo-Scheduling of Loops
  • Summary

48
Register Model Summary
  • GR Stack
  • Overlap call/ret operations with real work
  • RSE hides spills/fillls
  • GR, FR, PR Rotation
  • General acceleration for all types of loops
  • SW-visible resources
  • Large named register files renaming
  • HW simplicity and explicit control

49
IA-64 Register Model Stack Rotation
  • Dale Morris
  • Architect
  • Hewlett Packard Co.
Write a Comment
User Comments (0)
About PowerShow.com