A Programmable Memory Hierarchy for Prefetching Linked Data Structures presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

1
A Programmable Memory Hierarchy for Prefetching
Linked Data Structures

Chia-Lin Yang
Department of Computer Science and Information
Engineering
National Taiwan University

Alvin R. Lebeck Department of Computer
Science Duke University
2
Memory Wall

Processor-memory gap grows over time
Prefetching
What ? Future Address Prediction
When? Prefetch Schedule

CPU performance 60 yr
Processor-Memory Gap
DRAM performance 10 yr
3
Prefetch Linked Data Structures (LDS)
p head while (p) work (p-gtdata)
p p -gt next

Linked data structures
No regularity in the address stream
Adjacent elements are not necessarily contiguous
in memory
Pointer-chasing problem

p
p
..
currently visiting
would like to prefetch
while (p) prefetch (p-gtnext-gtnext-gtnext)
work (p-gtdata) p p -gt next
4
The Push Architecture

A LDS prefetching framework built on a novel data
movement model - Push (Yang2000)

Traditional Pull Model
New Push Model
5
Outline

Background Motivation
What is the Push Architecture?
Design of the Push Architecture
Variations of the Push Architecture
Experimental Results
Related Research
Conclusion

6
Block Diagram of the Push Architecture
prefetch req
Prefetch Buffer
Prefetch Engine
L1
L2 Bus
Prefetch Engine
prefetch req
L2
Memory Bus
Main Memory
Prefetch Engine
prefetch req
7
How to Predict Future Addresses?

LDS traversal kernels
Load instructions in LDS traversal kernels are a
compact representation of LDS accesses Roth98
PFEs execute LDS traversal kernels independent of
the CPU
The amount of computation between node accesses
affects how far the PFE could run ahead of the
CPU

while ( list ! NULL) p list-gtx
process (p-gtdata)
list list-gtnext recurrent load
8
The Pointer-Chasing Problem how does the push
model help?
L1

Push model pipelined process

2
3
4
1
L2
Main Memory
PFE
1
9
Push Architecture Design Issues
1. PFE Architecture Design
CPU
L1
controller
PFE
L2
PFE
controller
Main Memory
PFE
controller
5. Demands on the cache/memory controller
10
ISSUE 1 PFE Architecture

Programmable PFE
General purpose processor core
5 stage pipeline, in-order processor
Integer ALU units for address calculation
control flow
TLB for address translation
Root register to store the root address of the
LDS being traversed

11
Issue 2 Interaction among PFEs
CPU
L1
Root Reg
PFE
Tree (root) Tree ( node)
if (node)
Tree (node-gtleft) Tree
(node-gtright)
resume
L2
Root Reg
PFE
resume
Root Reg
Mem
PFE
12
Issue 3 Synchronization between CPU and PFEs

When do we need to synchronize the CPU and PFE
execution?
Early prefetches
the PFEs are running too far ahead of the CPU
Useless prefetches
the PFEs are traversing down the wrong path
the PFEs are running behind the CPU
Throttle mechanism

Free Bit Cache Blocks
1
0
0
1
produce
consume
PFE
CPU
Prefetch Buffer
13
Variations of the Push Architecture
PFE
PFE
push
PFE
PFE
PFE
PFE
3_PFE
2_PFE
1_PFE

2_PFE should perform comparably to 3_PFE

1_PFE performs well if most of LDS exist only in
the main memory

14
Outline

Background Motivation
What is the Push Architecture?
Design of the Push Architecture
Variations of the Push Architecture
Experimental Results
Related Research
Conclusion

15
Experimental Setup

SimpleScalar out-of-order processor
Benchmark
Olden benchmark suite rayshade
Baseline processor
4-way issue, 64 RUU, 16 LSQ
lockup-free caches with 8 outstanding misses
32KB, 32B line, 2-way L1 512K, 64B line, 4-way
L2
84 cycle round-trip memory latency 48 cycle
DRAM access time
Prefetch model
Push model 3 level PFEs, 32-entry
fully-associative prefetch buffer
Pull model L1 level PFE, 32-entry
fully-associative prefetch buffer

16
Performance Comparison Push vs. Pull

health, mst, perimeter and treeadd
Push 4 to 25 speedup Pull 0 to 4 speedup
em3d, rayshade
Push 31 to 57 speedup Pull 25 to 39
speedup
bh
Push 33 speedup Pull 33
speedup
Dynamically changing structures bisort and tsp

17
Variations of the Push Architecture

2_PFE performs comparably to 3_PFE
1_PFE performs comparably to 3_PFE except for
em3d.

18
Related Work

Prefetching for Irregular Applications
Correlation based prefetch (Joseph97 and
Alexander96)
Compiler based prefetch (Luk96)
Dependence based prefetch (Roth98)
Jump-pointer prefetch (Roth99)
Decoupled Architecture
Decoupled Access Execute (Smith82)
Pre-execution (Annavaram2001,Collin2001,
Roth2001, Zilles2001, Luk2001)
Processor-in-Memory
Berkley IRAM Group (Patterson97)
Active Page (Oskin98)
FlexRAM (Kang99)
Impulse (Carter99)
Memory-side prefetching (Hughes2000)

19
Conclusion

Build a general architectural solution for the
push model
The push model is effective in reducing the
impact of the pointer-chasing problem on
prefetching performance
applications with tight traversal loops
Push 4 to 25 Pull 0 to 4
applications with longer computation between node
accesses
Push 31 to 57 Pull 25 to 39
2_PFE performs comparably to 3_PFE.

20
Traversal Kernel
void HashLookup(int key, hash hash) j
(hash-gtmapfunc)(key) for (ent
hash-gtarrayj ent ent-gtkey ! key ent
ent-gtnext) if (ent) return ent-gtentry
return Null
CPU

traversal kernel identifier
hash-gtarrayj
key

memory-mapped interface
void kernel (HashEntry ent, int key) for (ent
ent ent-gtkey ! key ent ent-gtnext)
PFE
21
Block Diagram of Specialized PFE
Recurrent Load Table
Ready Queue (pc, base, offset)
Root Register

Non-Recurrent Load Table

Kernel Id Register
TLB
Result Buffer (pc)
Instruction Buffer
Traversal-Info Table
Cache/Memory Controller
22
Block Diagram of Programmable PFE
Processor
Register File
Root reg
Stack
Instruction Cache
Result Buffer
Kernel Id Register
TLB
Instruction Buffer
Kernel Index Table
Cache/Memory Controller
memory-mapped structure
23
Issue 4 Redundant Prefetches

Redundant prefetches
Tree traversals

L1
L2
Main Memory
24
Issue 4 Redundant Prefetches

Performance impact
Waste bus bandwidth
Memory accesses are satisfied more slowly in the
lower level of memory hierarchy
Add a small data cache in the L2/Memory PFEs

request
Cache/Memory Controller
PFE Processor
miss request
Data Cache
result
25
Issue 5 Modifications to Cache/Memory Controller
L1
demand requests merge
MSHR
L2 Bus
Request Buffer
L2
MSHR
Memory Bus
Main Memory
Request Buffer
26
How to Avoid Early Prefetches?
t1
t2
t3

5
2
5

3
3
3
4
4
4
1
1
2
2
9
9
3
6
10
13
3
6
10
13
4
5
7
8
11
12
14
15
4
5
7
8
11
12
14
15
27
How to Avoid Early Prefetches?
t1
t3
Free Bit Data
0
0
0
Free Bit Data
1
0
0
2
PFE
2
PFE
3
3
continue execution
suspend execution
4
4
1
1
2
2
9
9
3
6
10
13
3
6
10
13
4
5
7
8
11
12
14
15
4
5
7
8
11
12
14
15
28
How to Avoid Useless Prefetches?
L1/L2 misses
1
2
3
4
5
6
L1 hits
t1
Free Bit Data
0
0
0
Free Bit Data
0
0
0
Mem PFE

2
2
3
3
trigger execution
4
4
1
2
3
4
5
29
How to Avoid Useless Prefetches?
L1/L2 misses
1
2
3
4
5
6
L1 hits
t1
t2
Free Bit Data
0
0
0
Free Bit Data
0
1
1
Mem PFE
Mem PFE
2
7
3
trigger execution
trigger execution
4
6
1
2
3
4
5
30
Performance Prediction of the Push Architecture
for Future Processors
31
Prefetch Coverage
32
Prefetch Distribution
33
Cumulative Distance between Recurrent Loads
34
Bandwidth Requirement
35
Effect of the PFE Data Cache Throttle Mechanism

The throttle mechanism has impact on bh.
The PFE data cache has impact on em3d, perimeter
and treeadd

36
Effect of the PFE Data Cache
of redundant prefetches are captured in the PFE
data cache
Redundant Prefetch Distribution

em3d, perimeter, bh and treeadd
30 to 50 of prefetches are redundant

70 to 100 of redundant prefetches
are captured in the PFE data cache

37
PFE Architecture Effect of Wider Issue PFEs

Increasing issue width further improves
performance, particularly
for em3d and treeadd

38
TLB Miss Effect

Hardware TLB miss handler, 30 cycle TLB miss
penalty

39
PFE Architecture Specialized vs. Programmable
PFE

A programmable PFE can achieve performance
comparable to
a specialized PFE

40
Breadth-First Tree Traversal
1
Traversal Kernel list head while
(list) node list-gtptr left
node-gtleft right node-gtright
list list-gtnext
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Head
Tail
8
9
10
13
14
15

41
Push Architecture Design Issues
1. PFE Architecture Design
CPU
L1
controller
PFE
L2
PFE
controller
Main Memory
PFE
controller
5. Demands on the cache/memory controller
42
Restore PFE State
Register File PC
x issued 400988 x miss
400990, 400950 - 400978 y
issued 400998
00400950 addiu sp29,sp29,-56 00400958
sw ra31,48(sp29) 00400960 sw
s830,44(sp29) 00400968 sw
s016,40(sp29) 00400970 addu
s830,zero0,sp29 00400978 addu
s016,zero0,a04 00400980 beq
s016,zero0,004009a8 (x)00400988 lw
a04,4(s016) miss 00400990 jal
00400950 ltK_TreeAddgt (y)00400998 lw
a04,8(s016) 004009a0 jal 00400950
ltK_TreeAddgt 004009a8 addu sp29,zero0,s8
30 004009b0 lw ra31,48(sp29)
004009b8 lw s830,44(sp29) 004009c0 lw
s016,40(sp29)
save registers in the stack
restore registers from the stack
43
Restore PFE State

Correct resume PC
Statically construct the resume PC table

Recurrent Load PC Resume PC
400988 400998
400998 4009a8

Write a Comment

User Comments (0)

About PowerShow.com

A Programmable Memory Hierarchy for Prefetching Linked Data Structures PowerPoint PPT Presentation