Title: A Programmable Memory Hierarchy for Prefetching Linked Data Structures
1A Programmable Memory Hierarchy for Prefetching
Linked Data Structures
- Chia-Lin Yang
- Department of Computer Science and Information
Engineering - National Taiwan University
Alvin R. Lebeck Department of Computer
Science Duke University
2Memory Wall
- Processor-memory gap grows over time
- Prefetching
- What ? Future Address Prediction
- When? Prefetch Schedule
CPU performance 60 yr
Processor-Memory Gap
DRAM performance 10 yr
3Prefetch Linked Data Structures (LDS)
p head while (p) work (p-gtdata)
p p -gt next
- Linked data structures
- No regularity in the address stream
- Adjacent elements are not necessarily contiguous
in memory - Pointer-chasing problem
p
p
..
currently visiting
would like to prefetch
while (p) prefetch (p-gtnext-gtnext-gtnext)
work (p-gtdata) p p -gt next
4The Push Architecture
- A LDS prefetching framework built on a novel data
movement model - Push (Yang2000)
Traditional Pull Model
New Push Model
5Outline
- Background Motivation
- What is the Push Architecture?
- Design of the Push Architecture
- Variations of the Push Architecture
- Experimental Results
- Related Research
- Conclusion
6Block Diagram of the Push Architecture
prefetch req
Prefetch Buffer
Prefetch Engine
L1
L2 Bus
Prefetch Engine
prefetch req
L2
Memory Bus
Main Memory
Prefetch Engine
prefetch req
7How to Predict Future Addresses?
- LDS traversal kernels
- Load instructions in LDS traversal kernels are a
compact representation of LDS accesses Roth98 - PFEs execute LDS traversal kernels independent of
the CPU - The amount of computation between node accesses
affects how far the PFE could run ahead of the
CPU
while ( list ! NULL) p list-gtx
process (p-gtdata)
list list-gtnext recurrent load
8The Pointer-Chasing Problem how does the push
model help?
L1
- Push model pipelined process
-
2
3
4
1
L2
Main Memory
PFE
1
9Push Architecture Design Issues
1. PFE Architecture Design
CPU
L1
controller
PFE
L2
PFE
controller
Main Memory
PFE
controller
5. Demands on the cache/memory controller
10ISSUE 1 PFE Architecture
- Programmable PFE
- General purpose processor core
- 5 stage pipeline, in-order processor
- Integer ALU units for address calculation
control flow - TLB for address translation
- Root register to store the root address of the
LDS being traversed
11Issue 2 Interaction among PFEs
CPU
L1
Root Reg
PFE
Tree (root) Tree ( node)
if (node)
Tree (node-gtleft) Tree
(node-gtright)
resume
L2
Root Reg
PFE
resume
Root Reg
Mem
PFE
12Issue 3 Synchronization between CPU and PFEs
- When do we need to synchronize the CPU and PFE
execution? - Early prefetches
- the PFEs are running too far ahead of the CPU
- Useless prefetches
- the PFEs are traversing down the wrong path
- the PFEs are running behind the CPU
- Throttle mechanism
-
Free Bit Cache Blocks
1
0
0
1
produce
consume
PFE
CPU
Prefetch Buffer
13Variations of the Push Architecture
PFE
PFE
push
PFE
PFE
PFE
PFE
3_PFE
2_PFE
1_PFE
- 2_PFE should perform comparably to 3_PFE
- 1_PFE performs well if most of LDS exist only in
the main memory
14Outline
- Background Motivation
- What is the Push Architecture?
- Design of the Push Architecture
- Variations of the Push Architecture
- Experimental Results
- Related Research
- Conclusion
15Experimental Setup
- SimpleScalar out-of-order processor
- Benchmark
- Olden benchmark suite rayshade
- Baseline processor
- 4-way issue, 64 RUU, 16 LSQ
- lockup-free caches with 8 outstanding misses
- 32KB, 32B line, 2-way L1 512K, 64B line, 4-way
L2 - 84 cycle round-trip memory latency 48 cycle
DRAM access time - Prefetch model
- Push model 3 level PFEs, 32-entry
fully-associative prefetch buffer - Pull model L1 level PFE, 32-entry
fully-associative prefetch buffer
16Performance Comparison Push vs. Pull
- health, mst, perimeter and treeadd
- Push 4 to 25 speedup Pull 0 to 4 speedup
- em3d, rayshade
- Push 31 to 57 speedup Pull 25 to 39
speedup - bh
- Push 33 speedup Pull 33
speedup - Dynamically changing structures bisort and tsp
17Variations of the Push Architecture
- 2_PFE performs comparably to 3_PFE
- 1_PFE performs comparably to 3_PFE except for
em3d.
18Related Work
- Prefetching for Irregular Applications
- Correlation based prefetch (Joseph97 and
Alexander96) - Compiler based prefetch (Luk96)
- Dependence based prefetch (Roth98)
- Jump-pointer prefetch (Roth99)
- Decoupled Architecture
- Decoupled Access Execute (Smith82)
- Pre-execution (Annavaram2001,Collin2001,
Roth2001, Zilles2001, Luk2001) - Processor-in-Memory
- Berkley IRAM Group (Patterson97)
- Active Page (Oskin98)
- FlexRAM (Kang99)
- Impulse (Carter99)
- Memory-side prefetching (Hughes2000)
19Conclusion
- Build a general architectural solution for the
push model - The push model is effective in reducing the
impact of the pointer-chasing problem on
prefetching performance - applications with tight traversal loops
- Push 4 to 25 Pull 0 to 4
- applications with longer computation between node
accesses - Push 31 to 57 Pull 25 to 39
- 2_PFE performs comparably to 3_PFE.
20Traversal Kernel
void HashLookup(int key, hash hash) j
(hash-gtmapfunc)(key) for (ent
hash-gtarrayj ent ent-gtkey ! key ent
ent-gtnext) if (ent) return ent-gtentry
return Null
CPU
- traversal kernel identifier
- hash-gtarrayj
- key
memory-mapped interface
void kernel (HashEntry ent, int key) for (ent
ent ent-gtkey ! key ent ent-gtnext)
PFE
21Block Diagram of Specialized PFE
Recurrent Load Table
Ready Queue (pc, base, offset)
Root Register
Non-Recurrent Load Table
Kernel Id Register
TLB
Result Buffer (pc)
Instruction Buffer
Traversal-Info Table
Cache/Memory Controller
22Block Diagram of Programmable PFE
Processor
Register File
Root reg
Stack
Instruction Cache
Result Buffer
Kernel Id Register
TLB
Instruction Buffer
Kernel Index Table
Cache/Memory Controller
memory-mapped structure
23Issue 4 Redundant Prefetches
- Redundant prefetches
- Tree traversals
L1
L2
Main Memory
24Issue 4 Redundant Prefetches
- Performance impact
- Waste bus bandwidth
- Memory accesses are satisfied more slowly in the
lower level of memory hierarchy - Add a small data cache in the L2/Memory PFEs
request
Cache/Memory Controller
PFE Processor
miss request
Data Cache
result
25Issue 5 Modifications to Cache/Memory Controller
L1
demand requests merge
MSHR
L2 Bus
Request Buffer
L2
MSHR
Memory Bus
Main Memory
Request Buffer
26How to Avoid Early Prefetches?
t1
t2
t3
5
2
5
3
3
3
4
4
4
1
1
2
2
9
9
3
6
10
13
3
6
10
13
4
5
7
8
11
12
14
15
4
5
7
8
11
12
14
15
27How to Avoid Early Prefetches?
t1
t3
Free Bit Data
0
0
0
Free Bit Data
1
0
0
2
PFE
2
PFE
3
3
continue execution
suspend execution
4
4
1
1
2
2
9
9
3
6
10
13
3
6
10
13
4
5
7
8
11
12
14
15
4
5
7
8
11
12
14
15
28How to Avoid Useless Prefetches?
L1/L2 misses
1
2
3
4
5
6
L1 hits
t1
Free Bit Data
0
0
0
Free Bit Data
0
0
0
Mem PFE
2
2
3
3
trigger execution
4
4
1
2
3
4
5
29How to Avoid Useless Prefetches?
L1/L2 misses
1
2
3
4
5
6
L1 hits
t1
t2
Free Bit Data
0
0
0
Free Bit Data
0
1
1
Mem PFE
Mem PFE
2
7
3
trigger execution
trigger execution
4
6
1
2
3
4
5
30Performance Prediction of the Push Architecture
for Future Processors
31Prefetch Coverage
32Prefetch Distribution
33Cumulative Distance between Recurrent Loads
34Bandwidth Requirement
35Effect of the PFE Data Cache Throttle Mechanism
- The throttle mechanism has impact on bh.
- The PFE data cache has impact on em3d, perimeter
and treeadd
36Effect of the PFE Data Cache
of redundant prefetches are captured in the PFE
data cache
Redundant Prefetch Distribution
- em3d, perimeter, bh and treeadd
- 30 to 50 of prefetches are redundant
- 70 to 100 of redundant prefetches
- are captured in the PFE data cache
37PFE Architecture Effect of Wider Issue PFEs
- Increasing issue width further improves
performance, particularly - for em3d and treeadd
38TLB Miss Effect
- Hardware TLB miss handler, 30 cycle TLB miss
penalty
39PFE Architecture Specialized vs. Programmable
PFE
- A programmable PFE can achieve performance
comparable to - a specialized PFE
40Breadth-First Tree Traversal
1
Traversal Kernel list head while
(list) node list-gtptr left
node-gtleft right node-gtright
list list-gtnext
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Head
Tail
8
9
10
13
14
15
41Push Architecture Design Issues
1. PFE Architecture Design
CPU
L1
controller
PFE
L2
PFE
controller
Main Memory
PFE
controller
5. Demands on the cache/memory controller
42Restore PFE State
Register File PC
x issued 400988 x miss
400990, 400950 - 400978 y
issued 400998
00400950 addiu sp29,sp29,-56 00400958
sw ra31,48(sp29) 00400960 sw
s830,44(sp29) 00400968 sw
s016,40(sp29) 00400970 addu
s830,zero0,sp29 00400978 addu
s016,zero0,a04 00400980 beq
s016,zero0,004009a8 (x)00400988 lw
a04,4(s016) miss 00400990 jal
00400950 ltK_TreeAddgt (y)00400998 lw
a04,8(s016) 004009a0 jal 00400950
ltK_TreeAddgt 004009a8 addu sp29,zero0,s8
30 004009b0 lw ra31,48(sp29)
004009b8 lw s830,44(sp29) 004009c0 lw
s016,40(sp29)
save registers in the stack
restore registers from the stack
43 Restore PFE State
- Correct resume PC
- Statically construct the resume PC table
Recurrent Load PC Resume PC
400988 400998
400998 4009a8