Title: Towards a More Principled Compiler: Progressive Backend Compiler Optimization
1Towards a More Principled CompilerProgressive
Backend Compiler Optimization
2Performance Gains Due to Compiler (gcc)
2.8Ghz Pentium 4, 1GB RAM, -O3
3The Future of Compiler Optimization
is this possible?
How do we exploit the existing optimization
potential?
Yes!
Need a more principled compiler
10-30 improvement just from reordering compiler
phases http//www.cs.rice.edu/keith/Adapt/
4Compiler code size improvement
5A Principled Compiler
- A compiler that
- knows right from wrong
- (less optimal from more optimal)
- follows a rigorous procedure to get the desired
output
6Todays Compiler
- Problems
- some phases not internally optimal
- purely heuristic solution
- machine description mostly ignored
- lack of integration between phases
target dependent
insn sched
machine description
reg alloc
insn select
branch opt
peephole
optimized program
7Ideal Compiler
- each phase locally optimal
- makes full use of machine description
- tight integration between phases
Absolutely no idea how to do this or if its even
possible
machine description
optimized program
8Towards a More Principled Compiler
- each phase locally optimal
- makes full use of machine description
- tight integration between phases
copy prop
DCE
PRE
loop unroll
const prop
code motion
GVN
inline
strength reduct
CSE
SCCP
peep-hole
reg alloc
branch opt
machine description
optimized program
9Outline
- Motivation
- Related Work
- Completed Work
- Proposed Work
- Contributions Timeline
10Register Allocation Problem
Related Work
unbounded number of program variables
limited number of processor registers slow
memory
spill code optimization
v 1 w v 3 x w v u v t u
x print(x) print(w) print(t) print(u)
register preferences
rematerialization
register allocator
live range splitting
memory operands
11Register Allocation Previous Work
Related Work
Method Expressive Fast Optimal
Linear Scan
Graph Coloring
Integer Linear Programming
Partitioned Boolean Quadratic Programming / /
12Instruction Selection Problem
Related Work
IR
Assem
instruction selector
IR Representation
movl (p),t1 leal (x,t1),t2 leal 1(y),t3 leal (t2,t
3),r
minimum cost tiling
13Instruction Selection Previous Work
Related Work
Method DAG Tiling Register Allocation Aware Fast Optimal
Dynamic Programming
Binate Covering
Peephole Based Instruction Selection
AVIV Code Generator
Exhaustive Search
14Outline
- Motivation
- Related Work
- Completed Work
- Proposed Work
- Contributions Timeline
15A More Principled Register Allocator
Completed Work
- fully utilize machine description
- explicit and expressive model of costs of
allocation for given architecture - optimal solutions
reg alloc
machine description
16Multi-commodity Network Flow An Expressive Model
Completed Work
- Given network (directed graph) with
- cost and capacity on each edge
- sources sinks for multiple commodities
- Find lowest cost flow of commodities
- NP-complete for integer flows
b
a
Example edges have unit capacity
0
1
b
a
17Register Allocation as a MCNF
Completed Work
Variables ? Commodities Variable Definition ?
Source Variable Last Use ? Sink Nodes ?
Allocation Classes (Reg/Mem/Const) Registers
Limits ? Node Capacities Spill Costs ? Edge
Costs Allocation ? Flow
a
a
r1
mem
1
3
18Example
Completed Work
Source Code int example(int a, int b) int d
1 int c a - b return cd
load cost
insn pref cost
Pre-alloc Assembly MOVE 1 -gt d SUB a,b -gt c ADD
c,d -gt c MOVE c -gt r0
mem access cost
19Control Flow
Completed Work
- MCNF can only represent straight-line code
- need to link together networks from basic blocks
Extend MCNF model with merge and split nodes to
implement boundary constraints.
a eax
details in proposal document
along with modeling persistence of values in
memory
a mem
a mem
20A Better Register Allocator
Completed Work
- fully utilize machine description
- explicit and expressive model of costs of
allocation for given architecture Global MCNF - locally optimal
- NP-hard, so use progressive solution technique
reg alloc
machine description
21A Better Register Allocator
Completed Work
- fully utilize machine description
- explicit and expressive model of costs of
allocation for given architecture Global MCNF - locally optimal
- NP-hard, so use progressive solution technique
reg alloc
machine description
22Progressive Solution Technique
Completed Work
- Quickly find a good allocation
- Then progressively find better allocations
- until optimal allocation found
- or time limit is reached
Allocation Quality
Compile Time
23Lagrangian Relaxation Intuition
Completed Work
- Relaxes the hard constraints
- only have to solve single commodity flow
- Combines easy subproblems using a Lagrangian
multiplier (price) - an additional price on each edge
- a price on each split/merge node
Example edges have unit capacity
24Solution Procedure
Completed Work
- Compute prices with iterative subgradient
optimization - guaranteed converge to optimal prices
- optimal for linear relaxation
- At each iteration, construct a feasible integer
solution using current prices - iterative allocator in document
- simultaneous allocator
- trace-based simultaneous allocator
25Simultaneous Allocator
Completed Work
Edges to/from memory cost 3
Current cost
-1
-3
-2
26Trace-Based Allocation
Completed Work
- Decompose function into traces of basic blocks
- run simultaneous allocator on each trace
- control flow internal to trace presents
difficulty - addressed in proposal document
27Evaluation
Completed Work
- Implemented in gcc 3.4.4 targeting x86
- Optimize for code size
- perfect static evaluation
- important metric in its own right
- MediaBench, MiBench, Spec95, Spec2000
- over 10,000 functions
28Progressiveness
Completed Work
squareEncrypt
29Progressiveness
Completed Work
quicksort
30Code Size
Completed Work
31Optimality
Completed Work
Proven optimality
32Compile Time Slowdown -(
Completed Work
9.2x slower
33A Better Register Allocator
Completed Work
- fully utilize machine description
- explicit and expressive model of costs of
allocation for given architecture Global MCNF - locally optimal
- approach optimality using progressive solution
technique Lagrangian directed allocators
34Outline
- Motivation
- Related Work
- Completed Work
- Proposed Work
- Contributions Timeline
35A Better Better Register Allocator
Proposed Work
- Solver Improvements
- Improve initial solution
- Improve quality as prices converge
- Hope to prove approximation bounds
- Model Improvements
- Improve accuracy of model
- Model simplification
- Represent uniform register sets efficiently
36Model Simplification
Proposed Work
Summarize overly expressive sections of the model
Conservative simplification does not change
optimal value Aggressive simplification explore
tradeoff between model complexity and optimality
37Instruction Selection Interaction
Proposed Work
- which instruction is best
- depends on the register allocator
- so let register allocator decide
perform same operation
38Register Allocation Aware Instruction SElection
(RA2ISE)
Proposed Work
- Instruction selection not finalized until
register allocation - IR tiled with Register Allocation Aware Tiles
(RAATs) - A RAAT represents several instruction sequences
- different costs
- a sequence for every possible register allocation
39RA2ISE
Proposed Work
RAAT
IR
tiling
model creation
register allocation
cwtl eax
40Implementing RA2ISE
Proposed Work
- Add side-constraints to Global MCNF model
- implement inter-variable preferences and
constraints - if x allocated to r1 and y allocated to r2, then
save three bytes - x and y must be allocated to the same register
- Implement x86 RAATs
- RAAT tables created manually
- GMCNF RAAT representation automatically generated
from RAAT table with minimum use of side
constraints - Algorithms for tiling RAATs
- leverage existing algorithms
- exploit feedback between passes
41Tiling RAATs
Proposed Work
2
4
2
4
1
3
feedback
42Evaluation
Proposed Work
- Implement in production quality compiler (gcc)
- Evaluate code size and simple code speed metric
- Evaluate on three different architectures
- x86 (8 registers)
- 68k/ColdFire (16 registers)
- PPC (32 registers)
43Outline
- Motivation
- Related Work
- Completed Work
- Proposed Work
- Contributions Timeline
44Contributions
- RA2ISE
- register allocation aware tiles (RAATs)
explicitly encode effect of register allocation
on instruction sequence - algorithms for tiling RAATs
- expressive model of register allocation that
operates on RAATs and explicitly represents all
important components of register allocation - progressive solver for this model that can
quickly find decent solution and approaches
optimality as more time is allowed for
compilation - Comprehensive evaluation of RA2ISE
45Thesis Statement
- RA2ISE is a principled and effective system for
performing instruction selection and register
allocation.
46One Step Towards a More Principled Compiler
copy prop
DCE
PRE
loop unroll
const prop
code motion
GVN
inline
strength reduct
CSE
SCCP
peep-hole
reg alloc
branch opt
machine description
optimized program
47Timeline
Fall 2006 add simple speed metric option to model begin model simplification work improve model accuracy and solver performance
Winter 2006 finish model simplification work add side-constraints to model implement existing gcc tiles as RAATs improve model accuracy and solver performance
Spring 2007 finish implementation of side-constraints and gcc RAATs begin work on RA2ISE infrastructure create gcc-independent set of RAATs for x86 improve model accuracy and solver performance
Summer 2007 finish work on RA2ISE investigate and develop tiling algorithms improve model accuracy and solver performance
Fall 2007 add 68k/ColdFire and PowerPC targets investigate uniform register set simplifications improve model accuracy and solver performance
Winter 2007 begin writing thesis work on improving compile time performance
Spring 2008 finish writing thesis
48Andrew Richard Koes
49Questions?
50Processor Performance
51Instruction Selection Register Allocation
- fully utilize machine description
- locally optimal
- tight integration between phases
reg alloc
machine description
insn select
52Costs of Register Allocation
- Spilling to/from memory
- movl 8(ebp), edx
- Direct memory access
- addl 8(ebp), eax
- Moving between registers
- movl edx,ecx
- Rematerialization of constant value
- movl 3,eax
- Register usage preferences
- imul edx,eax
- vs.
- imul edx,ecx
53Iterative Heuristic Allocator
- Allocate each variable in a heuristic priority
order - Find shortest path in each block
- avoid edges that make remaining problem
infeasible - Process blocks in topological order
- allocation at block entry fixed by previous blocks
Intuition
- shortest path is minimum cost allocation for a
variable - allocate most significant variables first
54Iterative Heuristic Allocator
Edges to/from memory cost 3
Allocation order a, b, c, d
Cost
Total 2
55Simultaneous Allocator
- Scan each block
- maintain an allocation of all live variables
- at variable definition find cheapest allocation
- allocation with shortest path to variables sink
or block exit - allowed to evict (reallocate) already allocated
variable - eviction cost shortest path to edge from current
allocation to new allocation in this block - cost of eviction added to shortest path cost
Intuition
- minimizing cost for all variables at once
56Trace-Based Allocation
- Decompose function into traces of basic blocks
- run simultaneous allocator on each trace
- control flow internal to trace
- update only blocks that are necessary
(easy-update) - update all effected blocks (full-update)
easy-update
full-update
57Accuracy of the Model
Global MCNF model correctly predicts costs of
register allocation within 2 for 72.5 of
functions compiled
58Compile Time Slowdown -(
10x slower
59Code size improvement
60Code Size Improvement
61Code Size Improvement
62Code Performance
63Integrating Register Allocation and Instruction
Selection
int foo(int a, short b) return a4b
4 movl 4(esp),eax 3 sall 2,eax 4 addl 8(esp),
eax 1 cwtl 1 ret
5 movswl 8(esp),edx 4 movl 4(esp),eax 3
leal (edx,eax,4),eax 1 ret
64 Another RAAT