IA64 Register Model: Stack - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

IA64 Register Model: Stack

Description:

dy[i] = dy[i] (da * dx[i]) 2 loads, 1 fma, 1 store / iteration. Machine assumptions ... load dx,dy. tmp = dy da * dx. store dy .rotf dx[3], dy[3], tmp[2] ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 50

Provided by: jwax4

Category:

more less

Transcript and Presenter's Notes

Title: IA64 Register Model: Stack

1
IA-64 Register Model Stack Rotation

Dale Morris
Architect
Hewlett Packard Co.

2
Philosophy

Large files
Most processors have lots of registers
Explicit control over register-renaming
Most processors have register renaming
IA-64 makes the register names SW-visible makes
the renaming explicit

3
Outline

Register Stack
Register Stack Engine
Register Rotation
Loop Branches
Modulo-Scheduling of Loops
Summary

4
Register Stack

Motivation
Automatic save/restore of GRs on procedure
call/return
Cache traffic reduction
Latency hiding of register spill/fill

5
General Registers
6
GR Stack Frame
size of frame (sof)
size of locals (sol)
Current Frame Marker (CFM)
7
GR Stack Frame - Example
8
GR Stack Frame - Call
9
GR Stack Frame - Allocate
inputs
10
GR Stack Frame - Return
11
Instructions

br.call
Copies CFM to PFM
Creates new frame with only output regs
Saves local regs from previous frame
alloc
Resizes current frame
Saves PFM to a GR

12
Instructions (cont.)

mov to PFS
Restores PFM from a GR
br.ret
Restores CFM from PFM
Restores local regs for previous frame

13
Leaf Procedure Optimization

No need to save/restore PFM
Can always use scratch static GRs
Can omit alloc if
Not many registers needed
Register rotation not needed

14
Register Save Engine

Automatically spills/fills registers from memory
as needed
Registers saved on a Backing Store Stack
Spills/fills NaT bits as well

15
Reg Stack Backing Store
A calls B calls C
current frame
call
unallocated
procC
sofc
procB
procB
solb
RSE loads/ stores
procA
procA
sola
procAs ancestors
unallocated
return
Physical stacked registers
Backing Store
16
Register Stack Summary

Exposes register renaming to SW
Avoids register spill when few needed
Hides register spill/fill
Programmable sizes
only use as many registers as you need

17
Outline

Register Stack
Register Stack Engine
Register Rotation
Loop Branches
Modulo-Scheduling of Loops
Summary

18
Register Rotation

Motivation
pipeline-schedule loops onto HW
remove extraneous work from loop
minimize start-up overhead
small code footprint
maximum computational throughput with few
instructions

19
GR Stack Frame w/ Rotation
127
sof
outputs
sol
locals
Size of Rotating (sor)
32
31
Static
0
Current Frame Marker (CFM)
sof
sol
sor
rrb.gr
rrb.fr
rrb.pr
20
GR Rotation

Size of rotating region multiple of 8
Rotating region overlays current frame
Starts at r32
Overlay allows rotation stack renaming in a
single level of adders
Must copy input registers before loop

21
FR Rotation
127
Rotating
Upper 3/4 of register file rotates
32
31
Static
0
22
Predicate Rotation
63
Rotating
Upper 3/4 of register file rotates
16
15
Static
0
23
Register Rotation RRB

Separate Rotating Register Base for each GRs,
FRs, PRs
Loop branches decrement all register rotating
bases (RRB)
Instructions contain a virtual register number
RRB virtual register number physical register
number.

. . .
36
Palm
35
34
33
32
. . .
Palm
Sunny
is
Springs
RRB0
24
Register Rotation RRB

Separate Rotating Register Base for each GRs,
FRs, PRs
Loop branches decrement all register rotating
bases (RRB)
Instructions contain a virtual register number
RRB virtual register number physical register
number.

IA-64
Palm
. . .
36
Palm
35
Springs
34
33
32
. . .
Palm
Sunny
is
Springs
RRB0
25
Register Rotation RRB

Separate Rotating Register Base for each GRs,
FRs, PRs
Loop branches decrement all register rotating
bases (RRB)
Instructions contain a virtual register number
RRB virtual register number physical register
number.

IA-64
Palm
Springs
. . .
Palm
35
Springs
34
is
33
32
127
. . .
Palm
Sunny
is
Springs
RRB-1
26
Register Rotation RRB

Separate Rotating Register Base for each GRs,
FRs, PRs
Loop branches decrement all register rotating
bases (RRB)
Instructions contain a virtual register number
RRB virtual register number physical register
number.

IA-64
Palm
Springs
is
. . .
Springs
34
is
33
Sunny
32
127
126
. . .
Palm
Sunny
is
Springs
RRB-2
27
Register Rotation RRB

Separate Rotating Register Base for each GRs,
FRs, PRs
Loop branches decrement all register rotating
bases (RRB)
Instructions contain a virtual register number
RRB virtual register number physical register
number.

IA-64
Palm
Springs
is
Sunny
. . .
is
33
Sunny
32
127
126
125
. . .
Palm
Sunny
is
Springs
RRB-3
28
Loop Branches

br.cloop uses LC for simple, non-pipelined loops
decrements LC and loops until LC is 0
br.ctop uses LC and EC for pipelined counted
loops
br.wtop uses branch predicate and EC for
pipelined while loops
br.cexit, br.wexit used for unrolled, pipelined
loops

29
br.ctop

Function (simplified)
if (LCgt0) LC-- pr631 rrb--
loopelse if (ECgt1) EC--
pr630 rrb-- loopelse EC--
pr630 rrb-- fall_through
LC counts main loop iterations
EC counts pipeline stages for drain

30
Software Pipelining

Overlapping execution of different loop iterations

vs.

More iterations in same amount of time

31
Software Pipelining

Traditional architectures use loop unrolling
High overhead extra code for loop body,
prologue, and epilogue
Synergistic use of IA-64 features
Full Predication
Special branches
Register rotation removes loop copy overhead
Predicate rotation removes prologue epilogue

Especially Useful for Integer Code With Small
Number of Loop Iterations
32
Pipelined Loop Example

DAXPY inner loop
dyi dyi (da dxi)
2 loads, 1 fma, 1 store / iteration
Machine assumptions
can do 2 loads, 1 store, 1 fma, 1 br / cycle
load latency of 2 clocks
fma latency of 1 clocks

33
Example Pipeline

Each column represents 1 source iteration

load dx,dy
tmp dy da dx
store dy
34
Example Code
.rotf dx3, dy3, tmp2 mov ar.lc 3 //
iterations-1 mov ar.ec 4 //
stages mov pr.rot 0x10000 looptop
(p16) ldfd dx0 dxsp,8 (p16) ldfd dy0
dysp,8 (p18) fma.d tmp0 da, dx2, dy2
(p19) stfd dydp tmp1,8 br.ctop
looptop
35
Loop Execution
Execution Sequence
(p16) ldx (p16) ldy (p18) fma (p19) st
(p19)
(p18)
(p16)
(p63)
LC3 EC4
RRB0
Initialization
36
Loop Execution
(p63)
LC2 EC4
RRB-1
37
Loop Execution
(p63)
LC1 EC4
RRB-2
38
Loop Execution
(p63)
LC0 EC4
RRB-3
39
Loop Execution
(p63)
LC0 EC3
RRB-4
40
Loop Execution
(p63)
LC0 EC2
RRB-5
41
Loop Execution
(p63)
LC0 EC1
RRB-6
42
Loop Execution
(p63)
LC0 EC0
RRB-7
43
Pipelining Latency

Suppose we change the latencies
load latency of 6 clocks
fma latency of 4 clocks

44
Example New Pipeline

Each column represents 1 source iteration

load dx,dy
tmp dy da dx
store dy
45
Updated Loop
.rotf dx7, dy7, tmp5 mov ar.lc 3 //
iterations-1 mov ar.ec 11 //
stages mov pr.rot 0x10000 looptop
(p16) ldfd dx0 dxsp,8 (p16) ldfd dy0
dysp,8 (p22) fma.d tmp0 da, dx6, dy6
(p26) stfd dydp tmp4,8 br.ctop
looptop
46
Rotation Summary