Title: EECS 583 Class 20 Compiler Directed Prefetching
1EECS 583 Class 20Compiler Directed Prefetching
- University of Michigan
- March 28, 2005
- Guest Speaker Manjunath Kudlur
2Announcements
- Scott back for Wednesdays class
- Register allocation
- Review for exam
- Exam next Monday in class (April 4)
- SIG meetings
- Well definitely have them this week on Thurs or
Fri - Figure out schedule on Wed in class
- Sorry about last week ? too much stuff got pushed
to Fri
3Introduction
- Applications spend most of their time waiting for
memory - Prefetching reduces stalls by overlapping data
fetch with computation - Why compiler based prefetching?
- Compiler has global view of data access patterns
- Can perform sophisticated and aggressive
optimisations
4Some Preliminaries
- Hardware support
- Non-faulting prefetch instruction in the ISA
- Lock up free cache multiple outstanding misses
- Characteristics of compiler based prefetching
algorithms - Accuracy whether missing loads and missing
addresses are correctly predicted - Timeliness
- Prefetching too early would pollute cache
- Prefetching too late is useless
- Instruction bandwidth Instructions to compute
prefetch address and the prefetch instruction
itself -
5In This Talk..
- Design and Evaluation of a Compiler Algorithm
for Prefetching, Todd C. Mowry, Monica S. Lam
and Anoop Gupta. ASPLOS V, 1992 - Automatic Compiler-Inserted Prefetching for
Pointer Based Applications, C.K. Luk and Todd C.
Mowry, IEEE Transactions on Computers, 1999 - Efficient Discovery of Regular Stride Patterns
in Irregular Programs and its Use in Compiler
Prefetching, Youfeng Wu, PLDI, 2002 - Compiler Orchestrated Prefetching via
Speculation and Predication, Rodric Rabbah
et.al., ASPLOS 2002 - Generating Cache Hints for improved program
efficiency, Kristof Beyls and Erik H.
DHollander, Journal of System Architecture, 2004
6A Prefetching Algorithm for Array Based Programs
- Handles programs where array accesses are
affine functions of iteration vector - for(i0 ilt100 i)
- for(j0 jlt100 j)
- Xij-1 ..
is the iteration vector
I
i
0
1
0
i
0
1
j
-1
j-1
7Locality Analysis
- Find out which references reuse the same cache
line - Tells us that the other references have to be
prefetched - Reuse computed in terms of iteration distances
- for(i0 igt32 i)
- for(j0 jlt32 j)
- ..Xij..
i
- Reuse points for X9 shown here
- Result from Linear Algebra
- The reuse points for XAI C lie in the Null
Space of the matrix A -
j
8(Overly Simplistic) Example
After loop splitting and prefetch insertion
Original Code
for(i0 ilt100 i) .. Bi
for(i0 ilt1 i) prefetch(Bi) for(i0
ilt100 i2) prefetch(Bi2) .. Bi
.. Bi1
Cache line size 8 bytes (2 elements
of B)
Prefetching this cache line
Software pipeline
Working on this cache line
9A Prefetching Algorithm for Pointer Based Programs
- Challenges in prefetching for pointer based
programs - Recursive data structures
- Trees, linked lists
- Many SPEC and real world programs use RDSs
extensively
10Greedy Prefetching
- Prefetch all child nodes of the current node
- Advantages
- Straightword to implement in a compiler
- Very less overhead
- Disadvantages
- Prefetching just one level of children may not
be enough
11Pointer Chasing Problem
- Have to prefetch a node far away to hide latency
- Have to dereference multiple levels of pointers
to get to far away node - Pointer chasing leads to loads, which themselves
cause misses defeats the purpose - History Pointer Prefetching
12History Pointer Prefetching
- Maintain pointers to far away nodes in current
node - The history pointers are filled during first
traversal - Data prefetched using history pointers during
later traversals - Disadvantages
- Extra space/computation overhead for history
pointers - First traversal not benefited
13Regular Stride Patterns
- Compiler prefetching at assembly level
- Consider individual load instructions, look for
patterns in the address they load - Artifacts of memory allocation order and access
patterns
197.parser, loads in S1 and S2 have a constant
stride 94 of the time
254.gap, load in S2 has 2 dominant strides 48
and 47 of times
14Use in Prefetching
- Discover regular strides based on profiling
- When 1or 2 dominant strides found for a load
instruction, insert prefetch accordingly
15Precomputation Based Prefetching
R1 list R5 0 loop R2 R1 4 R3
load R2 R4 R1 8 R1 load R4 R5 R5
R3 br loop (R1 ! NULL)
// record is a pointer to a structure in
memory record list while (record ! NULL)
data record-gtfield record
record-gtnext sum data
16Precomputation Based Prefetching (Contd..)
R1 list R5 0 R6 R1 R8 loop R2
R1 4 R3 load R2 R4 R1 8 R1 load
R4 R5 R5 R3 br loop (R1 ! NULL)
R7 load R6 R6 R7 8 R8 prefR7 R9 R8
8
17Informing Memory Operations
- What if the precomputation operations themselves
cause cache miss - Should abandon prefetch when not profitable
- Informing load operation
- Sets some architectural state to inform the
programmer about its state - Eg., in Itanium 2, iLD sets a predicate register
when its a miss - Can use iLD to inform later prefetch instructions
18Informing Memory Operations
R1 list R5 0 R6 R1 R8 loop R2
R1 4 R3 load R2 R4 R1 8 R1 load
R4 R5 R5 R3 br loop (R1 ! NULL)
R7 iLD R6 R6 R7 8 if p1 R8 prefR7 if
p1 R9 R8 8 if p1
19Cache Hints
- Annotations to regular memory instructions
- Source cache specifier
- Indicates at which cache level the data is likely
to be found - Used by the compiler to estimate the access
latency - Target cache specifier
- Indicates at which cache level the data is kept
after execution - Used by hardware to make decisions on data
placement
20Cache Hints on Itanium
- Target cache hints specify whether there is
temporal locality at a given cache level - t1, nt1, nt2, nta
21Static Cache Hint Selection
- Reuse distance number of unique memory elements
accessed between two accesses to the same element - Property 1 An access a will hit in a fully
associative LRU cache with n lines iff its
backward reuse distance is less than n - Eg a6 will be a hit if RD(a1,a6) 3 lt n
- Property 2 Data accessed by access a will
remain in memory iff its forward reuse distance
is less than n
22Static Cache Hint Selection (Contd..)
- Plot a graph of reuse distance vs. number of
times that distance was seen - Pick cache hint accordingly
- Profit!