Title: Static Identification of Delinquent Loads
1Static Identification of Delinquent Loads
- V.M. Panait
- Sasturkar
- W.-F. Fong
2Agenda
- Introduction
- Related Work
- Delinquent Loads
- Framework
- Address Patterns, Decision Criteria
- The heuristic types of classes, computing the
weights, final classes - Results
3Introduction
- Cache one of the major current bottlenecks in
performance - One approach prefetch but prefetch what ? Cant
prefetch everything - Few loads are really bad delinquent loads
- This paper classification of address patterns in
the load instructions
4Introduction
- Done after code generation, but before runtime
- Singled out 10 of all loads causing over 90 of
the misses in 18 SPEC benchmarks - Gets even better combined with basic block
profiling 1.3 loads covering over 80 of the
misses
5Related Work
- BDH method classify loads based on following
criteria - Region of memory accessed by the load S (stack),
H (heap) or G (global). - Kind of reference loading a scalar (S), element
of array (A) or field of a structure (S) - Type of reference (P)ointer or (N)ot.
6Related Work
- Some classes account for most misses GAN, HSN,
HFN, HAN, HFP, HAP. - The OKN method 3 simple heuristics
- Use of a pointer dereference
- Use of a strided reference
- None of the above
- This paper is much more precise than both above
methods
7Delinquent Loads
- Why not stores too ? Write buffers are apparently
good enough - Why not do it in hardware ? They do, but
- Need additional specialized hardware
- Complex decisions (fast) lt-gt complex hardware
- Memory profiling not always practical
8Delinquent Loads Profiling
9Framework
- Assembly code -gt address patterns for each load
instruction -gt placement of the load instruction
in a class - Classes weights -gt heuristic function
- If the value of the heuristic is greater than a
delinquency threshold, the instruction is
classified as possibly delinquent
10Address Patterns
- Address Pattern summary of how the source
address of the load instruction is computed - Uses CFG and DF analysis (reaching definitions)
(one address pattern for each control path
reaching the load) - Only uses basic registers (BR) gp, sp, regparam,
regret
11The Decision Criteria
- Classes are derived from these criteria
- H1 Register usage in an address pattern (usage
of BRs) - H2 Type of operations used in address
computation (arithmetic, logic) - H3 Maximum level of dereferencing
12The Decision Criteria
- H4 Recurrence (iterative walk through memory)
- H5 Execution frequency based on BB profiling
classifies loads as - Rarely executed (used here as negative)
- Seldom executed (idem)
- Fairly often executed (not used here)
- In a program hotspot
13Decision Criteria and Classes
- Each criterion results in a set of classes
- Class set of address patterns with a certain
property - There are too many classes that can result only
some are considered, and some of those are also
aggregated into one class
14Decision Criteria and Classes
- H1 based classes enumerations of the number of
occurrences of each of the 4 BRs in an address
pattern - H2 based classes address patterns with
multiplications and shift operations - H3 based classes as many as there are levels
of dereferencing in the address patterns
15Decision Criteria and Classes
- H4 based classes two classes (address pattern
involves recurrence or not) - H5 based classes three classes rarely, seldom
and program hotspot
16Experimental Setup
- SimpleScalar toolkit cache simulator (for cache
hits misses), compiler, objdump - Procedure Fortran -gt C code (via f2c) -gt MIPS
executable (via C2MIPS compiler) -gt disassembled
code (via objdump) - Reconstruction of CFG and DF analysis
17Experimental Setup
- 2 stages learning/training and experimental
(actual) - Stage 1 get full memory profiling data on a
subset of SPEC benchmarks, use it to compute
weights for each class - Use the heuristic thus obtained on a new subset
of benchmarks
18The Heuristic Types of Classes
- Three types of classes
- Positive (loads in it are likely delinquent)
- Negative ( not )
- Neutral
- Positive classes have positive weights, negative
ones have negative weights, neutral classes have
a weight of zero
19The Heuristic Terminology
- The miss probability of class F in benchmark j
- The amount of misses accounted for by members of
class F in benchmark j
20The Heuristic Terminology
- mj(F,C) likelihood of an instruction of class F
in benchmark j to be a cache miss - However, if that instruction is only executed
once, it wont be a delinquent load - nj(F,C) proportion out of total number of
misses that members of F account for
21The Heuristic Terminology
- Strength index r mj / nj
- A benchmark j is irrelevant to a class F if both
indices mj and nj are below certain thresholds.
Otherwise it is relevant. - Positive class r gt 5 for all benchs.
- Negative class nj lt 0.5 for all benchs.
- Neutral class r lt 5 for 1 benchs.
22Computing the Weights
- Form classes according to the five decision
criteria - Compute mj, nj for each class
- Weight of class Fk
23Computing the Weights
- This is the formula for positive classes only
- Only relevant benchmarks are included in the
formula - . is the cardinality of that set, i.e. the
number of benchmarks relevant to that class
24Aggregate Classes
- AG1 both gp and sp are used 1 each (comes from
H1) - AG2 only sp used 2 (H1)
- AG3 either or shifts are used (H2)
- AG4 one level dereferencing (H3)
- AG5 two level dereferencing (H3)
- AG6 three level dereferencing (H3)
25Aggregate Classes
- AG7 address patterns containing a recurrence
(H4) - AG8 loads with low frequency of execution (100 lt
f lt 1000) (H5) - AG9 loads with fairly low frequency of execution
(f lt 100 times) (H5) - Weight formula for negative classes negated mean
of positive weights
26The Heuristic Function
- 1 if
- 0 otherwise
- the load is delinquent
27Precision and Coverage
- Precision of a heuristic scheme H, ?(H) the
(correct) number of loads that scheme H
identifies as delinquent (the lower, i.e., closer
to the real one, the better) - Coverage of a heuristic scheme H, ?(H) the
number of cache misses caused by loads identified
as delinquent by scheme H (the closer to 100,
the better)
28Results on different inputs
29Results when varying cache associativity
30Results when varying cache size
31Performance on new benchmarks
32Performance summary
33Performance of OKN BDH
34Performance with various ?
35Combination with BB profiling
- Use the heuristic to sharpen the set returned by
BB profiling - Also add loads that are not in the hotspots
- ? is the percentage of the highest scoring loads
detected by our method but not by profiling that
we consider to be delinquent
36Combination with BB profiling
37Conclusions
- The static scheme for identifying delinquent
loads has a precision of 10 and coverage of over
90 over 18 benchmarks - More precise than related work, similar coverage
- Immune to variation of framework parameters (e.g.
cache size, assoc., input)