The SGI Pro64 Compiler Infrastructure - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

The SGI Pro64 Compiler Infrastructure

Description:

loop body, hammock region, etc. Hyperblock formation algorithm ... Hammock regions. Innermost loops. General regions (path based) Paths sorted by priorities (freq. ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 112
Provided by: GAO159
Category:

less

Transcript and Presenter's Notes

Title: The SGI Pro64 Compiler Infrastructure


1
The SGI Pro64 Compiler Infrastructure
- A Tutorial
  • Guang R. Gao (U of Delaware) J. Dehnert (SGI)
  • J. N. Amaral (U of Alberta) R. Towle (SGI)

2
Acknowledgement
  • The SGI Compiler Development Teams
  • The MIPSpro/Pro64 Development Team
  • University of Delaware
  • CAPSL Compiler Team
  • These individuals contributed directly to this
    tutorial
  • A. Douillet (Udel) F. Chow (Equator)
  • S. Chan (Intel) W. Ho (Routefree)
  • Z. Hu (Udel) K. Lesniak (SGI)
  • S. Liu (HP) R. Lo (Routefree)
  • S. Mantripragada (SGI) C. Murthy (SGI)
  • M. Murphy (SGI) G. Pirocanac (SGI)
  • D. Stephenson (SGI) D. Whitney (SGI)
  • H. Yang (Udel)

3
What is Pro64?
  • A suite of optimizing compiler tools for Linux/
    Intel IA-64 systems
  • C, C and Fortran90/95 compilers
  • Conforming to the IA-64 Linux ABI and API
    standards
  • Open to all researchers/developers in the
    community
  • Compatible with HP Native User Environment

4
Who Might Want to Use Pro64?
  • Researchers test new compiler analysis and
    optimization algorithms
  • Developers retarget to another
    architecture/system
  • Educators a compiler teaching platform

5
Outline
  • Background and Motivation
  • Part I An overview of the SGI Pro64 compiler
    infrastructure
  • Part II The Pro64 code generator design
  • Part III Using Pro64 in compiler research
    development
  • SGI Pro64 support
  • Summary

6
PART I Overview of the Pro64 Compiler
7
Outline
  • Logical compilation model and component flow
  • WHIRL Intermediate Representation
  • Inter-Procedural Analysis (IPA)
  • Loop Nest Optimizer (LNO) and Parallelization
  • Global optimization (WOPT)
  • Feedback
  • Design for debugability and testability

8
Logical Compilation Model
driver (sgicc/sgif90/sgiCC)
front end IPA (gfec/gfecc/mfef90)
back end (be, as)
linker (ld)
WHIRL (.B/.I)
obj (.o)
Src (.c/.C/.f)
a.out/.so
Data Path
Fork and Exec
9
Components of Pro64
Front end
Interprocedural Analysis and Optimization
Loop Nest Optimization and Parallelization
Global Optimization
Code Generation
10
Data Flow Relationship Between Modules
-O3
-IPA
LNO
Local IPA
Main IPA
Lower to High W.
.B
Inliner
gfec
.I
lower I/O
gfecc
(only for f90)
.w2c.c
WHIRL C
f90
.w2c.h
.w2f.f
WHIRL fortran
-O0
Take either path
Lower all
CG
Very high WHIRL
-phase woff
High WHIRL
Main opt
Lower Mid W
-O2/O3
Mid WHIRL
Low WHIRL
11
Front Ends
  • C front end based on gcc
  • C front end based on g
  • Fortran90/95 front end from MIPSpro

12
Intermediate Representation
  • IR is called WHIRL
  • Tree structured, with references to symbol table
  • Maps used for local or sparse annotation
  • Common interface between components
  • Multiple languages, multiple targets
  • Same IR, 5 levels of representation
  • Continuous lowering during compilation
  • Optimization strategy tied to level

13
IPA Main Stage
  • Analysis
  • alias analysis
  • array section
  • code layout
  • Optimization (fully integrated)
  • inlining
  • cloning
  • dead function and variable elimination
  • constant propagation

14
IPA Design Features
  • User transparent
  • No makefile changes
  • Handles DSOs, unanalyzed objects
  • Provide info (e.g. alias analysis, procedure
    properties) smoothly to
  • loop nest optimizer
  • main optimizer
  • code generator

15
Loop Nest Optimizer/Parallelizer
  • All languages (including OpenMP)
  • Loop level dependence analysis
  • Uniprocessor loop level transformations
  • Automatic parallelization

16
Loop Level Transformations
  • Based on unified cost model
  • Heuristics integrated with software pipelining
  • Loop vector dependency info passed to CG
  • Loop Fission
  • Loop Fusion
  • Loop Unroll and Jam
  • Loop Interchange
  • Loop Peeling
  • Loop Tiling
  • Vector Data Prefetching

17
Parallelization
  • Automatic
  • Array privatization
  • Doacross parallelization
  • Array section analysis
  • Directive based
  • OpenMP
  • Integrated with automatic methods

18
Global Optimization Phase
  • SSA is unifying technology
  • Use only SSA as program representation
  • All traditional global optimizations implemented
  • Every optimization preserves SSA form
  • Can reapply each optimization as needed

19
Pro64 Extensions to SSA
  • Representing aliases and indirect memory
    operations (Chow et al, CC 96)
  • Integrated partial redundancy elimination (Chow
    et al, PLDI 97 Kennedy et al, CC 98, TOPLAS 99)
  • Support for speculative code motion
  • Register promotion via load and store placement
    (Lo et al, PLDI 98)

20
Feedback
  • Used throughout the compiler
  • Instrumentation can be added at any stage
  • Explicit instrumentation data incorporated where
    inserted
  • Instrumentation data maintained and checked for
    consistency through program transformations.

21
Design for Debugability (DFD) and Testability
(DFT)
  • DFD and DFT built-in from start
  • Can build with extra validity checks
  • Simple option specification used to
  • Substitute components known to be good
  • Enable/disable full components or specific
    optimizations
  • Invoke alternative heuristics
  • Trace individual phases

22
Where to Obtain Pro64 Compiler and its Support
  • SGI Source download
  • http//oss.sgi.com/projects/Pro64/
  • University of Delaware Pro64 Support Group
  • http//www.capsl.udel.edu/pro64
  • pro64_at_capsl.udel.edu

23
Overview of The Pro64 Code Generator
PART II
24
Outline
  • Code generator flow diagram
  • WHIRL/CGIR and TARG-INFO
  • Hyperblock formation and predication (HBF)
  • Predicate Query System (PQS)
  • Loop preparation (CGPREP) and software pipelining
  • Global and local instruction scheduling (IGLS)
  • Global and local register allocation (GRA, LRA)

25
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
26
WHIRL
  • Abstract syntax tree based
  • Symbol table links, map annotations
  • Base representation is simple and efficient
  • Used through several phases with lowering
  • Designed for multiple target architectures

27
From WHIRL to CGIR An Example
  • T1 sp a
  • T2 ld T1
  • T3 sp i
  • T4 ld T3
  • T5 sxt T4
  • T6 T5 ltlt 2
  • T7 T6
  • T8 T2 T7
  • T9 ld T8
  • T10 sp aa
  • st T10 T9

ST aa
int a int i int aa aa ai
LD

a

CVTL32
4
i
(a) Source
(b) WHIRL
(c) CGIR
28
Code Generation Intermediate Representation (CGIR)
  • TOPs (Target Operations) are quads
  • Operands/results are TNs
  • Basic block nodes in control flow graph
  • Load/store architecture
  • Supports predication
  • Flags on TOPs (copy ops, integer add, load, etc.)
  • Flags on operands (TNs)

29
From WHIRL to CGIR
Contd
  • Information passed
  • alias information
  • loop information
  • symbol table and maps

30

The Target Information Table (TARG_INFO)
  • Objective
  • Parameterized description of a target machine and
    system architecture
  • Separates architecture details from the
    compilers algorithms
  • Minimizes compiler changes when targeting a new
    architecture

31
The Target Information Table (TARG_INFO)
Contd
  • Based on an extension of Cydra tables, with major
    improvements
  • Architecture models have already targeted
  • Whole MIPS family
  • IA-64
  • IA-32
  • SGI graphics processors (earlier version)

32
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
33
Hyperblock Formation and Predicated Execution
  • Hyperblock single-entry multiple-exit
    control-flow region
  • loop body, hammock region, etc.
  • Hyperblock formation algorithm
  • Based on Scott Mahlkes method Mahlke96
  • But, less aggressive tail duplication

34
Hyperblock Formation Algorithm
  • Hammock regions
  • Innermost loops
  • General regions (path based)
  • Paths sorted by priorities (freq., size, length,
    etc.)
  • Inclusion of a path is guided by its impact on
    resources, scheduling height, and priority level
  • Internal branches are removed via predication
  • Predicate reuse

Region Identification
Block Selection
Tail Duplication
If Conversion
Objective Keep the scheduling height close to
that of the highest priority path.
35
Hyperblock Formation - An Example
1
1
aa ai bb bi switch (aa) case 1
if (aa lt tabsiz) aa tabaa case 2
if (bb lt tabsiz) bb tabbb default
ans aa bb
2
4
4
2
1
5
4,5
5
2
6
6
6
6,7
7
8
7
7
8
8
8
H1
H2
(a) Source
(c) Hyperblock formation with aggressive
tail duplication
(b) CFG
36
Hyperblock Formation - An Example
Contd
1
1
1
2
4
4
2
4
2
H1
5
5
5
6
6
6
6
7
7
7
7
8
8
H2
8
H1
H2
8
(b) Hyperblock formation with aggressive
tail duplication
(c) Pro64 hyperblock formation
(a) CFG
37
Features of the Pro64 Hyperblock Formation (HBF)
Algorithm
  • Form good vs. maximal hyperblocks
  • Avoid unnecessary duplication
  • No reverse if-conversion
  • Hyperblocks are not a barrier to global code
    motion later in IGLS

38
Predicate Query System (PQS)
  • Purpose gather information and provide
    interfaces allowing other phases to make queries
    regarding the relationships among predicate
    values
  • PQS functions (examples)
  • BOOL PQSCG_is_disjoint (PQS_TN tn1, PQS_TN
    tn2)
  • BOOL PQSCG_is_subset (PQS_TN_SET
    tns1, PQS_TN_SET tns2)

39
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
40
Loop Preparation and Optimization for Software
Pipelining
  • Loop canonicalization for SWP
  • Read/Write removal (register aware)
  • Loop unrolling (resource aware)
  • Recurrence removal or extension
  • Prefetch
  • Forced if-conversion

41
Pro64 Software Pipelining Method Overview
  • Test for SWP-amenable loops
  • Extensive loop preparation and optimization
    before application DeTo93
  • Use lifetime sensitive SWP algorithm Huff93
  • Register allocation after scheduling based on
    Cydra 5 RLTS92, DeTo93
  • Handle both while and do loops
  • Smooth switching to normal scheduling if not
    successful.

42
Pro64 Lifetime-Sensitive Modulo Scheduling for
Software Pipelining
  • Features
  • Try to place an op ASAP or ALAP to minimize
    register pressure
  • Slack scheduling
  • Limited backtracking
  • Operation-driven scheduling framework

Compute Estart/Lstart for all unplaced ops
Choose a good op to place into the current
partial schedule within its Estart/Lstart range
yes
Register allocate
Succeed
no
done
Eject conflicting Ops
43
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
44
Integrated Global Local Scheduling (IGLS) Method
  • The basic IGLS framework integrates global code
    motion (GCM) with local scheduling MaJD98
  • IGLS extended to hyperblock scheduling
  • Performs profitable code motion between
    hyperblock regions and normal regions

45
IGLS Phase Flow Diagram
Hyperblock Scheduling (HBS)
Block Priority Selection Motion
Selection Target Selection
Global Code Motion (GCM)
Local Code Scheduling (LCS)
46
Advantages of the Extended IGLSMethod - The
Example Revisited
1
  • Advantages
  • No rigid boundaries between hyperblocks and
    non-hyperblocks
  • GCM moves code into and out of a hyperblock
    according to profitability

1
2
4
H1
4
2
H1
5
5
6
6
7
7
8
8
H2
H2
H3
8
(a) Pro64 hyperblock
(b) Profitable duplication
47
Software Pipelining vsNormal Scheduling
a SWP-amenable loop candidate ?
No
Yes
IGLS
Inner loop processing software pipelining
GRA/LRA
Failure/not profitable
IGLS
Code Emission
Success
48
Flowchart of Code Generator
WHIRL
Control Flow Opt II EBO
WHIRL-to-TOP Lowering
EBO Extended basic block optimization peephole, e
tc.
CGIR Quad Op List
IGLS pre-pass GRA, LRA, EBO IGLS
post-pass Control Flow Opt
Control Flow Opt I EBO
Hyperblock Formation Critical-Path Reduction
PQS Predicate Query System
Code Emission
Process Inner Loops unrolling, EBO Loop prep,
software pipelining
49
Global and Local Register Allocation(GRA/LRA)
From prepass IGLS
  • LRA-RQ provides an estimate of local register
    requirements
  • Allocates global variables using a priority-based
    register allocator ChowHennessy90,Chow83,
    Briggs92
  • Incorporates IA-64 specific extensions, e.g.
    register stack usage

GRA
LRA Register Request LRA-RQ
Priority Based Register Allocation with IA-64
Extensions
LRA
To postpass IGLS
50
Local Register Allocation (LRA)
  • Assign_registers using reverse linear scan
  • Reordering depth-first ordering on the DDG

Assign_Registers
succeed
failed
Fix_LRA
first time
Instruction reordering
Spill global spill local
51
Future Research Topics for Pro64 Code Generator
  • Hyperblock formation
  • Predicate query system
  • Enhanced speculation support

52
PART III Using Pro64 in Compiler Research and
Development
  • Case Studies

53
Outline
  • General Remarks
  • Case Study I Integration of new instruction
    reordering algorithm to minimize register
    pressure Govind,Yang,Amaral,Gao2000
  • Case Study II Design and evaluation of an
    induction pointer prefetching algorithm
    Stouchinin,Douillet,Amaral,Dehnert,Gao2000

54
Case I
  • Introduction of the Minimum Register Instruction
    Sequence (MRIS) problem and a proposed solution
  • Problem formulation
  • The proposed algorithm
  • Pro64 porting experience
  • Where to start
  • How to start
  • Results
  • Summary

55
Researchers
  • R. Govindarajan (Indian Inst. Of Science)
  • Hongbo Yang (Univ. of Delaware)
  • Chihong Zhang (Conexant)
  • José Nelson Amaral (Univ. of Alberta)
  • Guang R. Gao (Univ. of Delaware)

56
The Minimum Register Instruction Sequence Problem
Given a data dependence graph G, derive an
instruction sequence S for G that is optimal in
the sense that its register requirement is
minimum.
57
A Motivating Example
(a) DDG (b) Instruction Sequence 1
(c) Instruction Sequence 2
  • Observation Register requirements drop 25 from
    (b) to (c) !

58
Motivation
  • IA-64 style processors
  • Reduce spills in local register allocation phase
  • Reduce Local Register Allocation (LRA) requests
    in Global Register Allocation (GRA) phase
  • Reduce overall register pressure on a per
    procedure basis
  • Out-of-order issue processor
  • Instruction reordering buffer
  • Register renaming

59
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
  • Register lineages
  • Live range of lineages
  • Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
60
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
  • Register lineages
  • Live range of lineages
  • Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register?
61
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
  • Register lineages
  • Live range of lineages
  • Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register? Can L2 and
L3 share the same register?
62
How to Solve the MRIS Problem?
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
  • Register lineages
  • Live range of lineages
  • Lineage interference

(c) Lineages
(a) Concepts
(b) DDG
Questions Can L1 and L2 share the same
register? Can L2 and
L3 share the same register?
Can L1 and L4 share the same
register? Can L2 and L4 share the same register?
63
Lineage Interference Graph
L1 (a, b, f, h) L2 (c, f) L3 (e, g,
h) L4 (d, g)
a
b
c
d
e
g
f
h
(a) Original DDG (b) Lineage Interference Graph
(LIG)
Question Is the lower bound of the required
registers 3? Challenge Derive a Heuristic
Register Bound (HRB)!
64
Our Solution Method
DDG
  • A good construction algorithm for LIG
  • An effective heuristic method to calculate the
    HRB
  • An efficient scheduling method (do not backtrack)

Form Lineage Interference Graph (LIG)
Derive HRB
Extended list-scheduling guided by HRB
A good instruction sequence
65
Pro64 Porting Experience
  • Porting plan and design
  • Implementation
  • Debugging and validation
  • Evaluation

66
Implementation
  • Dependence graph construction
  • LIG formation
  • LIG construction and coloring
  • The reordering algorithm implementation

67
Porting Plan and Design
../common/targ_info/abi/ia64
  • Understand the compiler infrastructure
  • Understand the register model (mainly from
    targ_info)
  • e.g.
  • register classes (int, float, predicate, app,
    control)
  • register save/restore conventions caller/callee
    save, return value, argument passing, stack
    pointer, etc.

68
Register Allocation
GRA
LRA At block level
Assign_Registers Fix_LRA_Blues
Succ?
Fail?
reschedule local code motion spill global or
local registers
69
Implementation
  • DDG construction use native service routines
    e.g. CG_DEP_Compute_Graph
  • LIG coloring using native support for set
    package (e.g. bitset.c)
  • Scheduler implementation vector package native
    support (e.g. cg_vector.cxx)
  • Access dependence graph using native service
    functions ARC_succs, ARC_preds, ARC_kind

70
Debugging and Validation
  • Trace file
  • tt540x1. General trace of LRA
  • tt45 0x4. Dependence graph building
  • tr53. Target Operations (TOP) before LRA
  • tr54. TOP after LRA

71
Evaluation
  • Static measurement
  • Fat point -tt54 0x40
  • Dynamic measurement
  • Hardware counter in R12k and perfex

72
Evaluation
  • For the MIPS R12K (SPEC95fp), the lineage-based
    algorithm reduce the number of loads executed by
    12, the number of stores by 14, and the
    execution time by 2.5 over a baseline.
  • It is slightly better than the algorithm in the
    MIPSPro compiler.

73
Case II
Design and Evaluation of an Induction Pointer
Prefetching Algorithm
74
Researchers
  • Artour Stoutchinin (STMicroelectronics)
  • José Nelson Amaral (Univ. of Alberta)
  • Guang R. Gao (Univ. of Delaware)
  • Jim Dehnert (Silicon Graphics Inc.)
  • Suneel Jain (Narus Inc.)
  • Alban Douillet (Univ. of Delaware)

75
Motivation
The important loops of many programs are
pointer-chasing loops that access recursive data
structures through induction pointers.
Example max 0 current head while(current
! NULL)
if(current-gtkey gt max) max
current-gtkey current
current-gtnext
76
Problem Statement
How to identify pointer-chasing recurrences?
How to decide whether there are enough processor
resources and memory bandwidth to profitably
prefetch an induction pointer?
How to efficiently integrate induction pointer
prefetching with loop scheduling based on the
profitability analysis?
77
Prefetching Costs
  • More instructions to issue
  • More memory traffic
  • Longer code (disruption in instruction cache)
  • Displacement of potentially good data from cache

After prefetching t226 lw 0x34(t228) tmp
subu t226, t226s tmp addu tmp, tmp tmp addu
t226, tmp pref 0x0(tmp) t226s t226
Before prefetching t226 lw 0x34(t228)
78
What to Prefetch?When to Prefetch it?
A good optimizing compiler should only prefetch
data that will actually be referenced.
It should prefetch far enough in advance to
prevent a cache miss when the reference occurs.
But, not too far in advance, because the data
might be evicted from the cache before it is
used, or might displace data that will be
referenced again.
79
Prefetch Address
In order to prefetch, the compiler must calculate
addresses that will be referenced in future
iterations of the loop.
For loops that access regular data structures,
such as vectors and matrices, compilers can use
static analysis of the array indexes to compute
the prefetching addresses.
How can we predict future values of induction
pointers?
80
Key Intuition
Recursive data structures are often allocated
at regular intervals.
Example curr head (item) malloc(sizeof(item)
) while(curr-gtkey get_key()) ! NULL)
curr-gtnext curr
(item)malloc(sizeof(item))
other_memory_allocations()
curr -gt next NULL
81
Pre-Fetching Technique
Example max 0 current head
tmp current while(current ! NULL)
if(current-gtkey gt max)
max current-gtkey current
current-gtnext stride current - tmp
prefetch(current
stridek) tmp current
82
Prefetch Sequence (R10K)
In our implementation, the stride is recomputed
in every iteration of the loop, making it
tolerant of (infrequent) stride changes.
stride addr - addr.prev stride stride
k addr.pref addr stride addr.prev addr pref
addr.pref
83
Identification of Pointer-Chasing Recurrences
A surprisingly simple method works well look in
the intermediate code for recurrence circuits
containing only loads with constant offsets.
Examples node ptr-gtnext r1 lt- load r2,
offset_next ptr node-gtptr r2 lt- load r1,
offset_ptr current current-gtnext r2 lt- load
r1 r1 lt- load r2, offset_next
84
Profitability Analysis
Goal Balance the gains and costs of prefetching.
Although we use resource estimates analogous
to those done for software pipelining, we
consider loop bodies with control flow.
How to estimate the resources available
for prefetching in a basic block B that belongs
to many data dependence recurrences?
85
Software Pipelining
  • What limits the speed of a loop?
  • Data dependences recurrence initiation interval
    (recMII)
  • Processor resources resource initiation
    interval (resMII)
  • Memory accesses memory initiation interval
    (memMII)

0
1
2
3
4
5
6
7
8
9
10
11
12
16
15
14
13
time
86
Data Dependences(recMII)
The recurrence minimum initiation interval
(recMII) is given by
for i 0 to N - 1 do a Xi Xi -
1 Ri b Yi Xi Zi - 1 c Zi
Yi 1 end
(dist,lat)
87
The recMII for Loops with Control Flow
An instruction of a basic block B, can belong to
many recurrences (with distinct control
paths). We define the recurrence MII of a load
operation L as L ? c means that the operation
L is part of the recurrence c.
Control Flow Graph
88
Processor Resources(resMII)
A basic block B may belong to multiple control
paths. We define the resource constraint of a
basic block B as the maximum over all control
paths that execute B.
Control Flow Graph
89
Available Memory Bandwidth
Processors with non-blocking caches can support
up to k outstanding cache misses without
stalling. We define the available memory
bandwidth of all control paths that execute a
basic block B as where m(p) is the number of
expected cache misses in each control path p.
Control Flow Graph
90
Profitability Analysis
Adding prefetch code for an induction pointer L
in a basic block B is profitable if both (1)
the mii due to recurrences that contain L is
greater than the resMII after prefetch
insertion, and (2) there is enough memory
bandwidth to enable another cache miss
without causing stalls.
91
Computing Available Memory Bandwidth
To compute the available memory bandwidth of a
control path we need to estimate how many cache
misses are expected in that control path.
We use a graph coloring technique over a cache
miss interference graph to predict which memory
references are likely to incur a miss.
92
The Miss Interference Graph
Two memory references interfere if 1. They are
both expected to miss the cache 2. They can both
be issued in the same iteration of the loop 3.
They do not fall into the same cache line
Miss Interference Graph assumptions 1. Loop
invariant references are cache hits
(global-pointer relative, stack-pointer relative,
etc). 2. Memory references on mutually exclusive
control paths do not interfere. 3. References
relative to the same base address interfere
only if their relative offset is larger than the
cache line.
93
Prefetching Algorithm
DoPrefetch(P,V,E) 1. C ? pointer-chasing
recurrences 2. R ? Prioritized list of induction
pointer loads in C 3. N ? Prioritized list of
other loads (not in C) 4. O ? R N 5. mark each
L in O as a cache miss 6. for each L in O, L ?
B 7. do if recMIIP(B) ? resMIIP(B) and
S(B) 8. then add prefetch for L to
B 9. mark L as cache
hit 10. endif 11. endfor
94
An Example
mcf minimal cost flow optimizer, (Konrad-Zuse
Informatics Center, Berlin)
1 while (arcin) 2 tail
arcin-gttail 3 if (tail-gttime
arcin-gtorg_cost gt latest) 4 arcin
(arc_t )tail-gtmark 5 continue
6 arc_cost
tail-gtpotential head_potential 7 if
(red_cost lt 0) 8 if (new_arcs lt
MAX_NEW_ARCS) 9
insert_new_arc(arcnew, new_arcs, tail,
head,
arc_cost, red_cost) 10
new_arcs 11
else if((cost_t)arcnew0.flow gt red_cost) 12
replace_weaker_arc(arcnew,
tail, head,
arc_cost, red_cost)
13 arcin (arc_t
)tail-gtmark
95
An Example
1 while (arcin) 2 tail
arcin-gttail 3 if (tail-gttime
arcin-gtorg_cost gt latest) 4 arcin
(arc_t )tail-gtmark 5 continue
6 arc_cost
tail-gtpotential head_potential 7 if
(red_cost lt 0) 8 if (new_arcs lt
MAX_NEW_ARCS) 9
insert_new_arc(arcnew, new_arcs, tail,
head,
arc_cost, red_cost) 10
new_arcs 11
else if((cost_t)arcnew0.flow gt red_cost) 12
replace_weaker_arc(arcnew,
tail, head,
arc_cost, red_cost)
13 arcin (arc_t
)tail-gtmark
96
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
97
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
98
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B1
B3
B2
9. t234 lw 0x2c(t228) 10. t235 subu t225,
t234 11. t233 addiu t235, 0x1e 12. bgez B7, t233
7. t226 lw 0x34(t228) 8. b B8
13. t236 slt t209. t175 14. Beq B6, t236, 0
B4
B5
B6
insert_new_arc()
replace_weaker_arc()
15. t226 lw 0x34(t228)
B7
B8
15. bne B1, t226, 0
99
B1
1. t228 lw 0x0(t226) 2. t229 lw 0x14(t226) 3.
t230 lw 0x38(t228) 4. t231 addu t229, t230 5.
t232 slt t220, 0 6. bne B3, t232, 0
B3
B4
B6
B5
B2
7. t226 lw 0x34(t228) 8. b B10
B7
15. t226 lw 0x34(t228)
B8
100
B1
1. t228 lw 0x0(t226) 1. tmp subu t228,
t228s 1. tmp addu tmp, tmp 1. tmp addw t228,
tmp 1. pref 0x34(tmp) 1. t228s t228 2. t229
lw 0x14(t226) 3. t230 lw 0x38(t228) 4. t231
addu t229, t230 5. t232 slt t220, 0 6. bne B3,
t232, 0
B3
B4
B6
B5
B2
7. t226 lw 0x34(t228) 7. tmp subu t226,
t226s 7. tmp addu tmp, tmp 7. tmp addu
t226, tmp 7. pref 0x0(tmp) 7. t226s t226
8. b B10
B7
15. t226 lw 0x34(t228) 15. tmp subu t226,
t226s 15. tmp addu tmp, tmp 15. tmp addu
t226, tmp 15. pref 0x0(tmp) 15. t226s t226
B8
101
When Pointer Prefetch Works
102
When Pointer Prefetch Does Not Help
103
Summary of Attributes
  • Software-only implementation
  • Simple candidate identification
  • Simple code transformation
  • No impact on user data structures
  • Simple profitability analysis, local to loop
  • Performance degradations are rare, minor

104
Open Questions
  • How often is the speculated stride correct?
  • Can instrumentation feedback help?
  • How well does the speculative prefetch work with
    other recursive data structures trees, graphs,
    etc?
  • How well does this approach work for read/write
    recursive data structures?

105
Related Work (Software)
  • Luk-Mowry (ASPLOS-96)
  • Greedy prefetching History-Pointer prefetching
    Data Linearization Prefetching
  • Change the data structure storage
  • Lipatsi et al. (Micro-95)
  • Prefetching pointers at procedure call sites
  • Liu-Dimitri-Kaeli (Journal of Syst. Arch.-99)
  • Maintains a table of offsets for prefetching

106
Related Work (Hardware)
  • Roth-Moshovos-Sohi (ASPLOS, 1998)
  • Gonzales-Gonzales (ICS, 1997)
  • Mehrotra (Urbana-Champaign, 1996)
  • Chen-Baer (Trans. Computer, 1995)
  • Charney-Reeves (Trans. Comp., 1994)
  • Jegou-Teman (ICS, 1993)
  • Fu-Patel (Micro, 1992)

107
Execution Time Measurements
108
Prefetch Improvement
109
L1 Cache Misses
110
L2 Cache Misses
111
TLB Misses
112
Benchmarks
gcc GNU C compiler li Lisp
interpreter mcf Minimal cost flow
solver parser Syntactic parser of English twolf
Place and route simulator mlp Multi-layer
perceptron simulator ft Minimum spanning tree
algorithm
Write a Comment
User Comments (0)
About PowerShow.com