A Programmable Memory Hierarchy for Prefetching Linked Data Structures PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A Programmable Memory Hierarchy for Prefetching Linked Data Structures


1
A Programmable Memory Hierarchy for Prefetching
Linked Data Structures
  • Chia-Lin Yang
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

Alvin R. Lebeck Department of Computer
Science Duke University
2
Memory Wall
  • Processor-memory gap grows over time
  • Prefetching
  • What ? Future Address Prediction
  • When? Prefetch Schedule

CPU performance 60 yr
Processor-Memory Gap
DRAM performance 10 yr
3
Prefetch Linked Data Structures (LDS)
p head while (p) work (p-gtdata)
p p -gt next
  • Linked data structures
  • No regularity in the address stream
  • Adjacent elements are not necessarily contiguous
    in memory
  • Pointer-chasing problem

p
p
..
currently visiting
would like to prefetch
while (p) prefetch (p-gtnext-gtnext-gtnext)
work (p-gtdata) p p -gt next
4
The Push Architecture
  • A LDS prefetching framework built on a novel data
    movement model - Push (Yang2000)

Traditional Pull Model
New Push Model
5
Outline
  • Background Motivation
  • What is the Push Architecture?
  • Design of the Push Architecture
  • Variations of the Push Architecture
  • Experimental Results
  • Related Research
  • Conclusion

6
Block Diagram of the Push Architecture
prefetch req
Prefetch Buffer
Prefetch Engine
L1
L2 Bus
Prefetch Engine
prefetch req
L2
Memory Bus
Main Memory
Prefetch Engine
prefetch req
7
How to Predict Future Addresses?
  • LDS traversal kernels
  • Load instructions in LDS traversal kernels are a
    compact representation of LDS accesses Roth98
  • PFEs execute LDS traversal kernels independent of
    the CPU
  • The amount of computation between node accesses
    affects how far the PFE could run ahead of the
    CPU

while ( list ! NULL) p list-gtx
process (p-gtdata)
list list-gtnext recurrent load
8
The Pointer-Chasing Problem how does the push
model help?
L1
  • Push model pipelined process

2
3
4
1
L2
Main Memory
PFE
1
9
Push Architecture Design Issues
1. PFE Architecture Design
CPU
L1
controller
PFE
L2
PFE
controller
Main Memory
PFE
controller
5. Demands on the cache/memory controller
10
ISSUE 1 PFE Architecture
  • Programmable PFE
  • General purpose processor core
  • 5 stage pipeline, in-order processor
  • Integer ALU units for address calculation
    control flow
  • TLB for address translation
  • Root register to store the root address of the
    LDS being traversed

11
Issue 2 Interaction among PFEs
CPU
L1
Root Reg
PFE
Tree (root) Tree ( node)
if (node)
Tree (node-gtleft) Tree
(node-gtright)
resume
L2
Root Reg
PFE
resume
Root Reg
Mem
PFE
12
Issue 3 Synchronization between CPU and PFEs
  • When do we need to synchronize the CPU and PFE
    execution?
  • Early prefetches
  • the PFEs are running too far ahead of the CPU
  • Useless prefetches
  • the PFEs are traversing down the wrong path
  • the PFEs are running behind the CPU
  • Throttle mechanism

Free Bit Cache Blocks
1
0
0
1
produce
consume
PFE
CPU
Prefetch Buffer
13
Variations of the Push Architecture
PFE
PFE
push
PFE
PFE
PFE
PFE
3_PFE
2_PFE
1_PFE
  • 2_PFE should perform comparably to 3_PFE
  • 1_PFE performs well if most of LDS exist only in
    the main memory

14
Outline
  • Background Motivation
  • What is the Push Architecture?
  • Design of the Push Architecture
  • Variations of the Push Architecture
  • Experimental Results
  • Related Research
  • Conclusion

15
Experimental Setup
  • SimpleScalar out-of-order processor
  • Benchmark
  • Olden benchmark suite rayshade
  • Baseline processor
  • 4-way issue, 64 RUU, 16 LSQ
  • lockup-free caches with 8 outstanding misses
  • 32KB, 32B line, 2-way L1 512K, 64B line, 4-way
    L2
  • 84 cycle round-trip memory latency 48 cycle
    DRAM access time
  • Prefetch model
  • Push model 3 level PFEs, 32-entry
    fully-associative prefetch buffer
  • Pull model L1 level PFE, 32-entry
    fully-associative prefetch buffer

16
Performance Comparison Push vs. Pull
  • health, mst, perimeter and treeadd
  • Push 4 to 25 speedup Pull 0 to 4 speedup
  • em3d, rayshade
  • Push 31 to 57 speedup Pull 25 to 39
    speedup
  • bh
  • Push 33 speedup Pull 33
    speedup
  • Dynamically changing structures bisort and tsp

17
Variations of the Push Architecture
  • 2_PFE performs comparably to 3_PFE
  • 1_PFE performs comparably to 3_PFE except for
    em3d.

18
Related Work
  • Prefetching for Irregular Applications
  • Correlation based prefetch (Joseph97 and
    Alexander96)
  • Compiler based prefetch (Luk96)
  • Dependence based prefetch (Roth98)
  • Jump-pointer prefetch (Roth99)
  • Decoupled Architecture
  • Decoupled Access Execute (Smith82)
  • Pre-execution (Annavaram2001,Collin2001,
    Roth2001, Zilles2001, Luk2001)
  • Processor-in-Memory
  • Berkley IRAM Group (Patterson97)
  • Active Page (Oskin98)
  • FlexRAM (Kang99)
  • Impulse (Carter99)
  • Memory-side prefetching (Hughes2000)

19
Conclusion
  • Build a general architectural solution for the
    push model
  • The push model is effective in reducing the
    impact of the pointer-chasing problem on
    prefetching performance
  • applications with tight traversal loops
  • Push 4 to 25 Pull 0 to 4
  • applications with longer computation between node
    accesses
  • Push 31 to 57 Pull 25 to 39
  • 2_PFE performs comparably to 3_PFE.

20
Traversal Kernel
void HashLookup(int key, hash hash) j
(hash-gtmapfunc)(key) for (ent
hash-gtarrayj ent ent-gtkey ! key ent
ent-gtnext) if (ent) return ent-gtentry
return Null
CPU
  1. traversal kernel identifier
  2. hash-gtarrayj
  3. key

memory-mapped interface
void kernel (HashEntry ent, int key) for (ent
ent ent-gtkey ! key ent ent-gtnext)
PFE
21
Block Diagram of Specialized PFE
Recurrent Load Table
Ready Queue (pc, base, offset)
Root Register

Non-Recurrent Load Table

Kernel Id Register
TLB
Result Buffer (pc)
Instruction Buffer
Traversal-Info Table
Cache/Memory Controller
22
Block Diagram of Programmable PFE
Processor
Register File
Root reg
Stack
Instruction Cache
Result Buffer
Kernel Id Register
TLB
Instruction Buffer
Kernel Index Table
Cache/Memory Controller
memory-mapped structure
23
Issue 4 Redundant Prefetches
  • Redundant prefetches
  • Tree traversals

L1
L2
Main Memory
24
Issue 4 Redundant Prefetches
  • Performance impact
  • Waste bus bandwidth
  • Memory accesses are satisfied more slowly in the
    lower level of memory hierarchy
  • Add a small data cache in the L2/Memory PFEs

request
Cache/Memory Controller
PFE Processor
miss request
Data Cache
result
25
Issue 5 Modifications to Cache/Memory Controller
L1
demand requests merge
MSHR
L2 Bus
Request Buffer
L2
MSHR
Memory Bus
Main Memory
Request Buffer
26
How to Avoid Early Prefetches?
t1
t2
t3






5
2
5



3
3
3
4
4
4
1
1
2
2
9
9
3
6
10
13
3
6
10
13
4
5
7
8
11
12
14
15
4
5
7
8
11
12
14
15
27
How to Avoid Early Prefetches?
t1
t3
Free Bit Data
0
0
0
Free Bit Data
1
0
0
2
PFE
2
PFE
3
3
continue execution
suspend execution
4
4
1
1
2
2
9
9
3
6
10
13
3
6
10
13
4
5
7
8
11
12
14
15
4
5
7
8
11
12
14
15
28
How to Avoid Useless Prefetches?
L1/L2 misses
1
2
3
4
5
6
L1 hits
t1
Free Bit Data
0
0
0
Free Bit Data
0
0
0
Mem PFE

2
2
3
3
trigger execution
4
4
1
2
3
4
5
29
How to Avoid Useless Prefetches?
L1/L2 misses
1
2
3
4
5
6
L1 hits
t1
t2
Free Bit Data
0
0
0
Free Bit Data
0
1
1
Mem PFE
Mem PFE
2
7
3
trigger execution
trigger execution
4
6
1
2
3
4
5
30
Performance Prediction of the Push Architecture
for Future Processors
31
Prefetch Coverage
32
Prefetch Distribution
33
Cumulative Distance between Recurrent Loads
34
Bandwidth Requirement
35
Effect of the PFE Data Cache Throttle Mechanism
  • The throttle mechanism has impact on bh.
  • The PFE data cache has impact on em3d, perimeter
    and treeadd

36
Effect of the PFE Data Cache
of redundant prefetches are captured in the PFE
data cache
Redundant Prefetch Distribution
  • em3d, perimeter, bh and treeadd
  • 30 to 50 of prefetches are redundant
  • 70 to 100 of redundant prefetches
  • are captured in the PFE data cache

37
PFE Architecture Effect of Wider Issue PFEs
  • Increasing issue width further improves
    performance, particularly
  • for em3d and treeadd

38
TLB Miss Effect
  • Hardware TLB miss handler, 30 cycle TLB miss
    penalty

39
PFE Architecture Specialized vs. Programmable
PFE
  • A programmable PFE can achieve performance
    comparable to
  • a specialized PFE

40
Breadth-First Tree Traversal
1
Traversal Kernel list head while
(list) node list-gtptr left
node-gtleft right node-gtright
list list-gtnext
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Head
Tail
8
9
10
13
14
15

41
Push Architecture Design Issues
1. PFE Architecture Design
CPU
L1
controller
PFE
L2
PFE
controller
Main Memory
PFE
controller
5. Demands on the cache/memory controller
42
Restore PFE State
Register File PC
x issued 400988 x miss
400990, 400950 - 400978 y
issued 400998
00400950 addiu sp29,sp29,-56 00400958
sw ra31,48(sp29) 00400960 sw
s830,44(sp29) 00400968 sw
s016,40(sp29) 00400970 addu
s830,zero0,sp29 00400978 addu
s016,zero0,a04 00400980 beq
s016,zero0,004009a8 (x)00400988 lw
a04,4(s016) miss 00400990 jal
00400950 ltK_TreeAddgt (y)00400998 lw
a04,8(s016) 004009a0 jal 00400950
ltK_TreeAddgt 004009a8 addu sp29,zero0,s8
30 004009b0 lw ra31,48(sp29)
004009b8 lw s830,44(sp29) 004009c0 lw
s016,40(sp29)
save registers in the stack
restore registers from the stack
43
Restore PFE State
  • Correct resume PC
  • Statically construct the resume PC table

Recurrent Load PC Resume PC
400988 400998
400998 4009a8
Write a Comment
User Comments (0)
About PowerShow.com