Title: compute
1Represention and Dynamic Prefetching of Linked
Data Structures Mark Whitney John
Kubiatowicz ROC retreat January 2002
Example linked list and its traversal
- Problems and Needed Improvements
- Prefetches are accurate but data often arrives
late - -Could use further monitoring of pipeline
depth - to ensure data arrives earlier
- Representation generation works well on
- linked lists but not on trees and arrays
- -Compression need refinement to handle
- these kinds of structures
Problem Prefetching of linked data structures
(i.e. linked lists, binary trees)
is difficult because elements occupy
non-contiguous portions of
memory. Solution -Build a representation of a
linked data structure based on
interdependencies between memory accesses to the
data structure. -Replay representation to
produce prefetch operations.
struct node_t int datum1 char datum2
struct node_t next struct node_t
node . . . for (node head node ! NULL
node node-next) if (key1 datum1
strcmp(key2,datum2)) break
Step 4. Begin prefetching by replaying
representation fragment
prefetch mem location 0x4000 prefetch mem
location 0x4004 prefetch mem location
0x4008 prefetch mem location 0x4100 . .
rule1 ? 3 rule1offset8 lds back3
offset0 lds back1 offset4 lds
back2 starting ptr 4000
Step 5. As prefetch results return from compute
processors memory system, initiate new
prefetches as dictated by representation
- Address accesses 1000, 1004, 1008, 2500, 2504,
2508, 3000 - obvious structure not apparent in simple list of
memory locations
prefetch result 7 prefetch result marvin
prefetch
Loop in C
prefetch 0x1000 prefetch 0x1004
for (node head node ! NULL node
node-next) if (key1 datum1
strcmp(key2,datum2)) break
compute
rule1 ? 3 rule1offset8 lds back3
offset0 lds back1 offset4 lds
back2 starting ptr 4000
0x118 lw 16,8(16) 0x100 lw 3,0(16)
Compiled loop
loop lw 10,0(16) lw 4,4(16)
bne 10,6,endif jal strcmp bne
7,0,endif j outloop endif lw
16,8(16) bne 16,0,loop outloop
Step 3. When load instruction at at
fragment start is again executed, fetch
fragment and send to prefetch proc
build
lw 16,8(16) lw 10,0(16) lw 4,4(16) lw
16,8(16) lw 10,0(16) lw 4,4(16) lw
16,8(16) lw 10,0(16) lw 4,4(16)
0x118 lw 16,8(16) produces 4000
Load instructions executed on compute processor
Example of representation building algorithm
0x118 lw 16,8(16) accesses 800
produces 1000
producing instruction address
producing instruction address
Loop from health benchmark in Olden suite
Representation generated
value produced
register
Dynamic memory accesses
while (list ! NULL) prefcount
i village-hosp.free_personnel p
list-patient / This is a bad load / if
(i 0) village-hosp.free_personnel--
p-time_left 3 p-time p-time_left
removeList((village-hosp.waiting), p)
addList((village-hosp.assess), p)
else p-time / so is this /
list list-forward
16
0x118
0x118
1000
rule1 ? 10 rule1offset8 lds back3
offset0 lds back1 offset4 lds back1
0x100 lw 10,0(16) accesses 1000
produces 1 0x104 lw 4,4(16) accesses
1004 produces 4580 0x118 lw
16,8(16) 0x100 lw 10,0(16) 0x104 lw
4,4(16) 0x118 lw 16,8(16) 0x100 lw
10,0(16) 0x104 lw 4,4(16)
producing instruction address
producing instruction address
value produced
register
16 10
0x118 0x100
0x118 0x100
1000 1
offset8
offset0
offset4
offset8
producing instruction address
producing instruction address
value produced
register
Representation cache
16 10 4
0x118 0x100 0x104
0x118 0x100 0x104
1000 1 4580
rule1 ? 3 rule1offset8 lds back3
offset0 lds back1 offset4 lds back2
0x118
. . .
Step 2. Compress chain of dependent loads
pick out highly compressible fragments to
cache.
Step 1. Extract dependent loads from instruction
stream on compute processor.