Towards a More Principled Compiler: Progressive Backend Compiler Optimization presentation

About This Presentation

Transcript and Presenter's Notes

Title: Towards a More Principled Compiler: Progressive Backend Compiler Optimization

1
Towards a More Principled CompilerProgressive
Backend Compiler Optimization

David Koes
8/28/2006

2
Performance Gains Due to Compiler (gcc)
2.8Ghz Pentium 4, 1GB RAM, -O3
3
The Future of Compiler Optimization
is this possible?
How do we exploit the existing optimization
potential?
Yes!
Need a more principled compiler
10-30 improvement just from reordering compiler
phases http//www.cs.rice.edu/keith/Adapt/
4
Compiler code size improvement
5
A Principled Compiler

A compiler that
knows right from wrong
(less optimal from more optimal)
follows a rigorous procedure to get the desired
output

6
Todays Compiler

Problems
some phases not internally optimal
purely heuristic solution
machine description mostly ignored
lack of integration between phases

target dependent
insn sched
machine description
reg alloc
insn select
branch opt
peephole

optimized program
7
Ideal Compiler

each phase locally optimal
makes full use of machine description
tight integration between phases

Absolutely no idea how to do this or if its even
possible
machine description
optimized program
8
Towards a More Principled Compiler

each phase locally optimal
makes full use of machine description
tight integration between phases

copy prop
DCE
PRE
loop unroll
const prop
code motion
GVN
inline
strength reduct
CSE
SCCP
peep-hole
reg alloc
branch opt
machine description

optimized program
9
Outline

Motivation
Related Work
Completed Work
Proposed Work
Contributions Timeline

10
Register Allocation Problem
Related Work
unbounded number of program variables
limited number of processor registers slow
memory
spill code optimization
v 1 w v 3 x w v u v t u
x print(x) print(w) print(t) print(u)
register preferences
rematerialization
register allocator
live range splitting
memory operands
11
Register Allocation Previous Work
Related Work
Method Expressive Fast Optimal
Linear Scan
Graph Coloring
Integer Linear Programming
Partitioned Boolean Quadratic Programming / /
12
Instruction Selection Problem
Related Work
IR
Assem
instruction selector
IR Representation
movl (p),t1 leal (x,t1),t2 leal 1(y),t3 leal (t2,t
3),r
minimum cost tiling
13
Instruction Selection Previous Work
Related Work
Method DAG Tiling Register Allocation Aware Fast Optimal
Dynamic Programming
Binate Covering
Peephole Based Instruction Selection
AVIV Code Generator
Exhaustive Search
14
Outline

Motivation
Related Work
Completed Work
Proposed Work
Contributions Timeline

15
A More Principled Register Allocator
Completed Work

fully utilize machine description
explicit and expressive model of costs of
allocation for given architecture
optimal solutions

reg alloc
machine description
16
Multi-commodity Network Flow An Expressive Model
Completed Work

Given network (directed graph) with
cost and capacity on each edge
sources sinks for multiple commodities
Find lowest cost flow of commodities
NP-complete for integer flows

b
a
Example edges have unit capacity
0
1
b
a
17
Register Allocation as a MCNF
Completed Work
Variables ? Commodities Variable Definition ?
Source Variable Last Use ? Sink Nodes ?
Allocation Classes (Reg/Mem/Const) Registers
Limits ? Node Capacities Spill Costs ? Edge
Costs Allocation ? Flow
a
a
r1
mem
1
3
18
Example
Completed Work
Source Code int example(int a, int b) int d
1 int c a - b return cd
load cost
insn pref cost
Pre-alloc Assembly MOVE 1 -gt d SUB a,b -gt c ADD
c,d -gt c MOVE c -gt r0
mem access cost
19
Control Flow
Completed Work

MCNF can only represent straight-line code
need to link together networks from basic blocks

Extend MCNF model with merge and split nodes to
implement boundary constraints.
a eax
details in proposal document
along with modeling persistence of values in
memory
a mem
a mem
20
A Better Register Allocator
Completed Work

fully utilize machine description
explicit and expressive model of costs of
allocation for given architecture Global MCNF
locally optimal
NP-hard, so use progressive solution technique

reg alloc
machine description
21
A Better Register Allocator
Completed Work

fully utilize machine description
explicit and expressive model of costs of
allocation for given architecture Global MCNF
locally optimal
NP-hard, so use progressive solution technique

reg alloc
machine description
22
Progressive Solution Technique
Completed Work

Quickly find a good allocation
Then progressively find better allocations
until optimal allocation found
or time limit is reached

Allocation Quality
Compile Time
23
Lagrangian Relaxation Intuition
Completed Work

Relaxes the hard constraints
only have to solve single commodity flow
Combines easy subproblems using a Lagrangian
multiplier (price)
an additional price on each edge
a price on each split/merge node

Example edges have unit capacity
24
Solution Procedure
Completed Work

Compute prices with iterative subgradient
optimization
guaranteed converge to optimal prices
optimal for linear relaxation
At each iteration, construct a feasible integer
solution using current prices
iterative allocator in document
simultaneous allocator
trace-based simultaneous allocator

25
Simultaneous Allocator
Completed Work
Edges to/from memory cost 3
Current cost
-1
-3
-2
26
Trace-Based Allocation
Completed Work

Decompose function into traces of basic blocks
run simultaneous allocator on each trace
control flow internal to trace presents
difficulty
addressed in proposal document

27
Evaluation
Completed Work

Implemented in gcc 3.4.4 targeting x86
Optimize for code size
perfect static evaluation
important metric in its own right
MediaBench, MiBench, Spec95, Spec2000
over 10,000 functions

28
Progressiveness
Completed Work
squareEncrypt
29
Progressiveness
Completed Work
quicksort
30
Code Size
Completed Work
31
Optimality
Completed Work
Proven optimality
32
Compile Time Slowdown -(
Completed Work
9.2x slower
33
A Better Register Allocator
Completed Work

fully utilize machine description
explicit and expressive model of costs of
allocation for given architecture Global MCNF
locally optimal
approach optimality using progressive solution
technique Lagrangian directed allocators

34
Outline

Motivation
Related Work
Completed Work
Proposed Work
Contributions Timeline

35
A Better Better Register Allocator
Proposed Work

Solver Improvements
Improve initial solution
Improve quality as prices converge
Hope to prove approximation bounds
Model Improvements
Improve accuracy of model
Model simplification
Represent uniform register sets efficiently

36
Model Simplification
Proposed Work
Summarize overly expressive sections of the model
Conservative simplification does not change
optimal value Aggressive simplification explore
tradeoff between model complexity and optimality
37
Instruction Selection Interaction
Proposed Work

which instruction is best
depends on the register allocator
so let register allocator decide

perform same operation
38
Register Allocation Aware Instruction SElection
(RA2ISE)
Proposed Work

Instruction selection not finalized until
register allocation
IR tiled with Register Allocation Aware Tiles
(RAATs)
A RAAT represents several instruction sequences
different costs
a sequence for every possible register allocation

39
RA2ISE
Proposed Work
RAAT
IR
tiling
model creation
register allocation
cwtl eax
40
Implementing RA2ISE
Proposed Work

Add side-constraints to Global MCNF model
implement inter-variable preferences and
constraints
if x allocated to r1 and y allocated to r2, then
save three bytes
x and y must be allocated to the same register
Implement x86 RAATs
RAAT tables created manually
GMCNF RAAT representation automatically generated
from RAAT table with minimum use of side
constraints
Algorithms for tiling RAATs
leverage existing algorithms
exploit feedback between passes

41
Tiling RAATs
Proposed Work
2
4
2
4
1
3
feedback
42
Evaluation
Proposed Work

Implement in production quality compiler (gcc)
Evaluate code size and simple code speed metric
Evaluate on three different architectures
x86 (8 registers)
68k/ColdFire (16 registers)
PPC (32 registers)

43
Outline

Motivation
Related Work
Completed Work
Proposed Work
Contributions Timeline

44
Contributions

RA2ISE
register allocation aware tiles (RAATs)
explicitly encode effect of register allocation
on instruction sequence
algorithms for tiling RAATs
expressive model of register allocation that
operates on RAATs and explicitly represents all
important components of register allocation
progressive solver for this model that can
quickly find decent solution and approaches
optimality as more time is allowed for
compilation
Comprehensive evaluation of RA2ISE

45
Thesis Statement

RA2ISE is a principled and effective system for
performing instruction selection and register
allocation.

46
One Step Towards a More Principled Compiler
copy prop
DCE
PRE
loop unroll
const prop
code motion
GVN
inline
strength reduct
CSE
SCCP
peep-hole
reg alloc
branch opt
machine description

optimized program
47
Timeline
Fall 2006 add simple speed metric option to model begin model simplification work improve model accuracy and solver performance
Winter 2006 finish model simplification work add side-constraints to model implement existing gcc tiles as RAATs improve model accuracy and solver performance
Spring 2007 finish implementation of side-constraints and gcc RAATs begin work on RA2ISE infrastructure create gcc-independent set of RAATs for x86 improve model accuracy and solver performance
Summer 2007 finish work on RA2ISE investigate and develop tiling algorithms improve model accuracy and solver performance
Fall 2007 add 68k/ColdFire and PowerPC targets investigate uniform register set simplifications improve model accuracy and solver performance
Winter 2007 begin writing thesis work on improving compile time performance
Spring 2008 finish writing thesis
48
Andrew Richard Koes
49
Questions?
50
Processor Performance
51
Instruction Selection Register Allocation

fully utilize machine description
locally optimal
tight integration between phases

reg alloc
machine description
insn select
52
Costs of Register Allocation

Spilling to/from memory
movl 8(ebp), edx
Direct memory access
addl 8(ebp), eax
Moving between registers
movl edx,ecx
Rematerialization of constant value
movl 3,eax
Register usage preferences
imul edx,eax
vs.
imul edx,ecx

53
Iterative Heuristic Allocator

Allocate each variable in a heuristic priority
order
Find shortest path in each block
avoid edges that make remaining problem
infeasible
Process blocks in topological order
allocation at block entry fixed by previous blocks

Intuition

shortest path is minimum cost allocation for a
variable
allocate most significant variables first

54
Iterative Heuristic Allocator
Edges to/from memory cost 3
Allocation order a, b, c, d
Cost
Total 2
55
Simultaneous Allocator

Scan each block
maintain an allocation of all live variables
at variable definition find cheapest allocation
allocation with shortest path to variables sink
or block exit
allowed to evict (reallocate) already allocated
variable
eviction cost shortest path to edge from current
allocation to new allocation in this block
cost of eviction added to shortest path cost

Intuition

minimizing cost for all variables at once

56
Trace-Based Allocation

Decompose function into traces of basic blocks
run simultaneous allocator on each trace
control flow internal to trace
update only blocks that are necessary
(easy-update)
update all effected blocks (full-update)

easy-update
full-update
57
Accuracy of the Model
Global MCNF model correctly predicts costs of
register allocation within 2 for 72.5 of
functions compiled
58
Compile Time Slowdown -(
10x slower
59
Code size improvement
60
Code Size Improvement
61
Code Size Improvement
62
Code Performance
63
Integrating Register Allocation and Instruction
Selection
int foo(int a, short b) return a4b
4 movl 4(esp),eax 3 sall 2,eax 4 addl 8(esp),
eax 1 cwtl 1 ret
5 movswl 8(esp),edx 4 movl 4(esp),eax 3
leal (edx,eax,4),eax 1 ret
64
Another RAAT

Write a Comment

User Comments (0)

About PowerShow.com

Towards a More Principled Compiler: Progressive Backend Compiler Optimization PowerPoint PPT Presentation