Title: IA64 Register Model: Stack
1IA-64 Register Model Stack Rotation
- Dale Morris
- Architect
- Hewlett Packard Co.
2Philosophy
- Large files
- Most processors have lots of registers
- Explicit control over register-renaming
- Most processors have register renaming
- IA-64 makes the register names SW-visible makes
the renaming explicit
3Outline
- Register Stack
- Register Stack Engine
- Register Rotation
- Loop Branches
- Modulo-Scheduling of Loops
- Summary
4Register Stack
- Motivation
- Automatic save/restore of GRs on procedure
call/return - Cache traffic reduction
- Latency hiding of register spill/fill
5General Registers
6GR Stack Frame
size of frame (sof)
size of locals (sol)
Current Frame Marker (CFM)
7GR Stack Frame - Example
8GR Stack Frame - Call
9GR Stack Frame - Allocate
inputs
10GR Stack Frame - Return
11Instructions
- br.call
- Copies CFM to PFM
- Creates new frame with only output regs
- Saves local regs from previous frame
- alloc
- Resizes current frame
- Saves PFM to a GR
12Instructions (cont.)
- mov to PFS
- Restores PFM from a GR
- br.ret
- Restores CFM from PFM
- Restores local regs for previous frame
13Leaf Procedure Optimization
- No need to save/restore PFM
- Can always use scratch static GRs
- Can omit alloc if
- Not many registers needed
- Register rotation not needed
14Register Save Engine
- Automatically spills/fills registers from memory
as needed - Registers saved on a Backing Store Stack
- Spills/fills NaT bits as well
15Reg Stack Backing Store
A calls B calls C
current frame
call
unallocated
procC
sofc
procB
procB
solb
RSE loads/ stores
procA
procA
sola
procAs ancestors
unallocated
return
Physical stacked registers
Backing Store
16Register Stack Summary
- Exposes register renaming to SW
- Avoids register spill when few needed
- Hides register spill/fill
- Programmable sizes
- only use as many registers as you need
17Outline
- Register Stack
- Register Stack Engine
- Register Rotation
- Loop Branches
- Modulo-Scheduling of Loops
- Summary
18Register Rotation
- Motivation
- pipeline-schedule loops onto HW
- remove extraneous work from loop
- minimize start-up overhead
- small code footprint
- maximum computational throughput with few
instructions
19GR Stack Frame w/ Rotation
127
sof
outputs
sol
locals
Size of Rotating (sor)
32
31
Static
0
Current Frame Marker (CFM)
sof
sol
sor
rrb.gr
rrb.fr
rrb.pr
20GR Rotation
- Size of rotating region multiple of 8
- Rotating region overlays current frame
- Starts at r32
- Overlay allows rotation stack renaming in a
single level of adders - Must copy input registers before loop
21FR Rotation
127
Rotating
Upper 3/4 of register file rotates
32
31
Static
0
22Predicate Rotation
63
Rotating
Upper 3/4 of register file rotates
16
15
Static
0
23Register Rotation RRB
- Separate Rotating Register Base for each GRs,
FRs, PRs - Loop branches decrement all register rotating
bases (RRB) - Instructions contain a virtual register number
- RRB virtual register number physical register
number.
. . .
36
Palm
35
34
33
32
. . .
Palm
Sunny
is
Springs
RRB0
24Register Rotation RRB
- Separate Rotating Register Base for each GRs,
FRs, PRs - Loop branches decrement all register rotating
bases (RRB) - Instructions contain a virtual register number
- RRB virtual register number physical register
number.
IA-64
Palm
. . .
36
Palm
35
Springs
34
33
32
. . .
Palm
Sunny
is
Springs
RRB0
25Register Rotation RRB
- Separate Rotating Register Base for each GRs,
FRs, PRs - Loop branches decrement all register rotating
bases (RRB) - Instructions contain a virtual register number
- RRB virtual register number physical register
number.
IA-64
Palm
Springs
. . .
Palm
35
Springs
34
is
33
32
127
. . .
Palm
Sunny
is
Springs
RRB-1
26Register Rotation RRB
- Separate Rotating Register Base for each GRs,
FRs, PRs - Loop branches decrement all register rotating
bases (RRB) - Instructions contain a virtual register number
- RRB virtual register number physical register
number.
IA-64
Palm
Springs
is
. . .
Springs
34
is
33
Sunny
32
127
126
. . .
Palm
Sunny
is
Springs
RRB-2
27Register Rotation RRB
- Separate Rotating Register Base for each GRs,
FRs, PRs - Loop branches decrement all register rotating
bases (RRB) - Instructions contain a virtual register number
- RRB virtual register number physical register
number.
IA-64
Palm
Springs
is
Sunny
. . .
is
33
Sunny
32
127
126
125
. . .
Palm
Sunny
is
Springs
RRB-3
28Loop Branches
- br.cloop uses LC for simple, non-pipelined loops
- decrements LC and loops until LC is 0
- br.ctop uses LC and EC for pipelined counted
loops - br.wtop uses branch predicate and EC for
pipelined while loops - br.cexit, br.wexit used for unrolled, pipelined
loops
29br.ctop
- Function (simplified)
- if (LCgt0) LC-- pr631 rrb--
loopelse if (ECgt1) EC--
pr630 rrb-- loopelse EC--
pr630 rrb-- fall_through - LC counts main loop iterations
- EC counts pipeline stages for drain
30Software Pipelining
- Overlapping execution of different loop iterations
vs.
- More iterations in same amount of time
31Software Pipelining
- Traditional architectures use loop unrolling
- High overhead extra code for loop body,
prologue, and epilogue - Synergistic use of IA-64 features
- Full Predication
- Special branches
- Register rotation removes loop copy overhead
- Predicate rotation removes prologue epilogue
Especially Useful for Integer Code With Small
Number of Loop Iterations
32Pipelined Loop Example
- DAXPY inner loop
- dyi dyi (da dxi)
- 2 loads, 1 fma, 1 store / iteration
- Machine assumptions
- can do 2 loads, 1 store, 1 fma, 1 br / cycle
- load latency of 2 clocks
- fma latency of 1 clocks
33Example Pipeline
- Each column represents 1 source iteration
load dx,dy
tmp dy da dx
store dy
34Example Code
.rotf dx3, dy3, tmp2 mov ar.lc 3 //
iterations-1 mov ar.ec 4 //
stages mov pr.rot 0x10000 looptop
(p16) ldfd dx0 dxsp,8 (p16) ldfd dy0
dysp,8 (p18) fma.d tmp0 da, dx2, dy2
(p19) stfd dydp tmp1,8 br.ctop
looptop
35Loop Execution
Execution Sequence
(p16) ldx (p16) ldy (p18) fma (p19) st
(p19)
(p18)
(p16)
(p63)
LC3 EC4
RRB0
Initialization
36Loop Execution
(p63)
LC2 EC4
RRB-1
37Loop Execution
(p63)
LC1 EC4
RRB-2
38Loop Execution
(p63)
LC0 EC4
RRB-3
39Loop Execution
(p63)
LC0 EC3
RRB-4
40Loop Execution
(p63)
LC0 EC2
RRB-5
41Loop Execution
(p63)
LC0 EC1
RRB-6
42Loop Execution
(p63)
LC0 EC0
RRB-7
43Pipelining Latency
- Suppose we change the latencies
- load latency of 6 clocks
- fma latency of 4 clocks
44Example New Pipeline
- Each column represents 1 source iteration
load dx,dy
tmp dy da dx
store dy
45Updated Loop
.rotf dx7, dy7, tmp5 mov ar.lc 3 //
iterations-1 mov ar.ec 11 //
stages mov pr.rot 0x10000 looptop
(p16) ldfd dx0 dxsp,8 (p16) ldfd dy0
dysp,8 (p22) fma.d tmp0 da, dx6, dy6
(p26) stfd dydp tmp4,8 br.ctop
looptop
46Rotation Summary
- Loop pipelining maximizes performance minimizes
overhead - Avoids code expansion of unrolling and code
explosion of prologue and epilogue - Smaller code means fewer cache misses
- Greater performance improvements in higher
latency conditions - Reduced overhead allows S/W pipelining of small
loops with unknown trip counts - Typical of integer scalar codes
47Outline
- Register Stack
- Register Stack Engine
- Register Rotation
- Loop Branches
- Modulo-Scheduling of Loops
- Summary
48Register Model Summary
- GR Stack
- Overlap call/ret operations with real work
- RSE hides spills/fillls
- GR, FR, PR Rotation
- General acceleration for all types of loops
- SW-visible resources
- Large named register files renaming
- HW simplicity and explicit control
49IA-64 Register Model Stack Rotation
- Dale Morris
- Architect
- Hewlett Packard Co.