Towards a More Principled Compiler: Progressive Backend Compiler Optimization PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Towards a More Principled Compiler: Progressive Backend Compiler Optimization


1
Towards a More Principled CompilerProgressive
Backend Compiler Optimization
  • David Koes
  • 8/28/2006

2
Performance Gains Due to Compiler (gcc)
2.8Ghz Pentium 4, 1GB RAM, -O3
3
The Future of Compiler Optimization
is this possible?
How do we exploit the existing optimization
potential?
Yes!
Need a more principled compiler
10-30 improvement just from reordering compiler
phases http//www.cs.rice.edu/keith/Adapt/
4
Compiler code size improvement
5
A Principled Compiler
  • A compiler that
  • knows right from wrong
  • (less optimal from more optimal)
  • follows a rigorous procedure to get the desired
    output

6
Todays Compiler
  • Problems
  • some phases not internally optimal
  • purely heuristic solution
  • machine description mostly ignored
  • lack of integration between phases

target dependent
insn sched
machine description
reg alloc
insn select
branch opt
peephole

optimized program
7
Ideal Compiler
  • each phase locally optimal
  • makes full use of machine description
  • tight integration between phases

Absolutely no idea how to do this or if its even
possible
machine description
optimized program
8
Towards a More Principled Compiler
  • each phase locally optimal
  • makes full use of machine description
  • tight integration between phases

copy prop
DCE
PRE
loop unroll
const prop
code motion
GVN
inline
strength reduct
CSE
SCCP
peep-hole
reg alloc
branch opt
machine description

optimized program
9
Outline
  1. Motivation
  2. Related Work
  3. Completed Work
  4. Proposed Work
  5. Contributions Timeline

10
Register Allocation Problem
Related Work
unbounded number of program variables
limited number of processor registers slow
memory
spill code optimization
v 1 w v 3 x w v u v t u
x print(x) print(w) print(t) print(u)
register preferences
rematerialization
register allocator
live range splitting
memory operands
11
Register Allocation Previous Work
Related Work
Method Expressive Fast Optimal
Linear Scan
Graph Coloring
Integer Linear Programming
Partitioned Boolean Quadratic Programming / /
12
Instruction Selection Problem
Related Work
IR
Assem
instruction selector
IR Representation
movl (p),t1 leal (x,t1),t2 leal 1(y),t3 leal (t2,t
3),r
minimum cost tiling
13
Instruction Selection Previous Work
Related Work
Method DAG Tiling Register Allocation Aware Fast Optimal
Dynamic Programming
Binate Covering
Peephole Based Instruction Selection
AVIV Code Generator
Exhaustive Search
14
Outline
  1. Motivation
  2. Related Work
  3. Completed Work
  4. Proposed Work
  5. Contributions Timeline

15
A More Principled Register Allocator
Completed Work
  • fully utilize machine description
  • explicit and expressive model of costs of
    allocation for given architecture
  • optimal solutions

reg alloc
machine description
16
Multi-commodity Network Flow An Expressive Model
Completed Work
  • Given network (directed graph) with
  • cost and capacity on each edge
  • sources sinks for multiple commodities
  • Find lowest cost flow of commodities
  • NP-complete for integer flows

b
a
Example edges have unit capacity
0
1
b
a
17
Register Allocation as a MCNF
Completed Work
Variables ? Commodities Variable Definition ?
Source Variable Last Use ? Sink Nodes ?
Allocation Classes (Reg/Mem/Const) Registers
Limits ? Node Capacities Spill Costs ? Edge
Costs Allocation ? Flow
a
a
r1
mem
1
3
18
Example
Completed Work
Source Code int example(int a, int b) int d
1 int c a - b return cd
load cost
insn pref cost
Pre-alloc Assembly MOVE 1 -gt d SUB a,b -gt c ADD
c,d -gt c MOVE c -gt r0
mem access cost
19
Control Flow
Completed Work
  • MCNF can only represent straight-line code
  • need to link together networks from basic blocks

Extend MCNF model with merge and split nodes to
implement boundary constraints.
a eax
details in proposal document
along with modeling persistence of values in
memory
a mem
a mem
20
A Better Register Allocator
Completed Work
  • fully utilize machine description
  • explicit and expressive model of costs of
    allocation for given architecture Global MCNF
  • locally optimal
  • NP-hard, so use progressive solution technique

reg alloc
machine description
21
A Better Register Allocator
Completed Work
  • fully utilize machine description
  • explicit and expressive model of costs of
    allocation for given architecture Global MCNF
  • locally optimal
  • NP-hard, so use progressive solution technique

reg alloc
machine description
22
Progressive Solution Technique
Completed Work
  • Quickly find a good allocation
  • Then progressively find better allocations
  • until optimal allocation found
  • or time limit is reached

Allocation Quality
Compile Time
23
Lagrangian Relaxation Intuition
Completed Work
  • Relaxes the hard constraints
  • only have to solve single commodity flow
  • Combines easy subproblems using a Lagrangian
    multiplier (price)
  • an additional price on each edge
  • a price on each split/merge node

Example edges have unit capacity
24
Solution Procedure
Completed Work
  • Compute prices with iterative subgradient
    optimization
  • guaranteed converge to optimal prices
  • optimal for linear relaxation
  • At each iteration, construct a feasible integer
    solution using current prices
  • iterative allocator in document
  • simultaneous allocator
  • trace-based simultaneous allocator

25
Simultaneous Allocator
Completed Work
Edges to/from memory cost 3
Current cost
-1
-3
-2
26
Trace-Based Allocation
Completed Work
  • Decompose function into traces of basic blocks
  • run simultaneous allocator on each trace
  • control flow internal to trace presents
    difficulty
  • addressed in proposal document

27
Evaluation
Completed Work
  • Implemented in gcc 3.4.4 targeting x86
  • Optimize for code size
  • perfect static evaluation
  • important metric in its own right
  • MediaBench, MiBench, Spec95, Spec2000
  • over 10,000 functions

28
Progressiveness
Completed Work
squareEncrypt
29
Progressiveness
Completed Work
quicksort
30
Code Size
Completed Work
31
Optimality
Completed Work
Proven optimality
32
Compile Time Slowdown -(
Completed Work
9.2x slower
33
A Better Register Allocator
Completed Work
  • fully utilize machine description
  • explicit and expressive model of costs of
    allocation for given architecture Global MCNF
  • locally optimal
  • approach optimality using progressive solution
    technique Lagrangian directed allocators

34
Outline
  1. Motivation
  2. Related Work
  3. Completed Work
  4. Proposed Work
  5. Contributions Timeline

35
A Better Better Register Allocator
Proposed Work
  • Solver Improvements
  • Improve initial solution
  • Improve quality as prices converge
  • Hope to prove approximation bounds
  • Model Improvements
  • Improve accuracy of model
  • Model simplification
  • Represent uniform register sets efficiently

36
Model Simplification
Proposed Work
Summarize overly expressive sections of the model
Conservative simplification does not change
optimal value Aggressive simplification explore
tradeoff between model complexity and optimality
37
Instruction Selection Interaction
Proposed Work
  • which instruction is best
  • depends on the register allocator
  • so let register allocator decide

perform same operation
38
Register Allocation Aware Instruction SElection
(RA2ISE)
Proposed Work
  • Instruction selection not finalized until
    register allocation
  • IR tiled with Register Allocation Aware Tiles
    (RAATs)
  • A RAAT represents several instruction sequences
  • different costs
  • a sequence for every possible register allocation

39
RA2ISE
Proposed Work
RAAT
IR
tiling
model creation
register allocation
cwtl eax
40
Implementing RA2ISE
Proposed Work
  • Add side-constraints to Global MCNF model
  • implement inter-variable preferences and
    constraints
  • if x allocated to r1 and y allocated to r2, then
    save three bytes
  • x and y must be allocated to the same register
  • Implement x86 RAATs
  • RAAT tables created manually
  • GMCNF RAAT representation automatically generated
    from RAAT table with minimum use of side
    constraints
  • Algorithms for tiling RAATs
  • leverage existing algorithms
  • exploit feedback between passes

41
Tiling RAATs
Proposed Work
2
4
2
4
1
3
feedback
42
Evaluation
Proposed Work
  • Implement in production quality compiler (gcc)
  • Evaluate code size and simple code speed metric
  • Evaluate on three different architectures
  • x86 (8 registers)
  • 68k/ColdFire (16 registers)
  • PPC (32 registers)

43
Outline
  1. Motivation
  2. Related Work
  3. Completed Work
  4. Proposed Work
  5. Contributions Timeline

44
Contributions
  • RA2ISE
  • register allocation aware tiles (RAATs)
    explicitly encode effect of register allocation
    on instruction sequence
  • algorithms for tiling RAATs
  • expressive model of register allocation that
    operates on RAATs and explicitly represents all
    important components of register allocation
  • progressive solver for this model that can
    quickly find decent solution and approaches
    optimality as more time is allowed for
    compilation
  • Comprehensive evaluation of RA2ISE

45
Thesis Statement
  • RA2ISE is a principled and effective system for
    performing instruction selection and register
    allocation.

46
One Step Towards a More Principled Compiler
copy prop
DCE
PRE
loop unroll
const prop
code motion
GVN
inline
strength reduct
CSE
SCCP
peep-hole
reg alloc
branch opt
machine description

optimized program
47
Timeline
Fall 2006 add simple speed metric option to model begin model simplification work improve model accuracy and solver performance
Winter 2006 finish model simplification work add side-constraints to model implement existing gcc tiles as RAATs improve model accuracy and solver performance
Spring 2007 finish implementation of side-constraints and gcc RAATs begin work on RA2ISE infrastructure create gcc-independent set of RAATs for x86 improve model accuracy and solver performance
Summer 2007 finish work on RA2ISE investigate and develop tiling algorithms improve model accuracy and solver performance
Fall 2007 add 68k/ColdFire and PowerPC targets investigate uniform register set simplifications improve model accuracy and solver performance
Winter 2007 begin writing thesis work on improving compile time performance
Spring 2008 finish writing thesis
48
Andrew Richard Koes
49
Questions?
50
Processor Performance
51
Instruction Selection Register Allocation
  • fully utilize machine description
  • locally optimal
  • tight integration between phases

reg alloc
machine description
insn select
52
Costs of Register Allocation
  • Spilling to/from memory
  • movl 8(ebp), edx
  • Direct memory access
  • addl 8(ebp), eax
  • Moving between registers
  • movl edx,ecx
  • Rematerialization of constant value
  • movl 3,eax
  • Register usage preferences
  • imul edx,eax
  • vs.
  • imul edx,ecx

53
Iterative Heuristic Allocator
  • Allocate each variable in a heuristic priority
    order
  • Find shortest path in each block
  • avoid edges that make remaining problem
    infeasible
  • Process blocks in topological order
  • allocation at block entry fixed by previous blocks

Intuition
  • shortest path is minimum cost allocation for a
    variable
  • allocate most significant variables first

54
Iterative Heuristic Allocator
Edges to/from memory cost 3
Allocation order a, b, c, d
Cost
Total 2
55
Simultaneous Allocator
  • Scan each block
  • maintain an allocation of all live variables
  • at variable definition find cheapest allocation
  • allocation with shortest path to variables sink
    or block exit
  • allowed to evict (reallocate) already allocated
    variable
  • eviction cost shortest path to edge from current
    allocation to new allocation in this block
  • cost of eviction added to shortest path cost

Intuition
  • minimizing cost for all variables at once

56
Trace-Based Allocation
  • Decompose function into traces of basic blocks
  • run simultaneous allocator on each trace
  • control flow internal to trace
  • update only blocks that are necessary
    (easy-update)
  • update all effected blocks (full-update)

easy-update
full-update
57
Accuracy of the Model
Global MCNF model correctly predicts costs of
register allocation within 2 for 72.5 of
functions compiled
58
Compile Time Slowdown -(
10x slower
59
Code size improvement
60
Code Size Improvement
61
Code Size Improvement
62
Code Performance
63
Integrating Register Allocation and Instruction
Selection
int foo(int a, short b) return a4b
4 movl 4(esp),eax 3 sall 2,eax 4 addl 8(esp),
eax 1 cwtl 1 ret
5 movswl 8(esp),edx 4 movl 4(esp),eax 3
leal (edx,eax,4),eax 1 ret
64
Another RAAT
Write a Comment
User Comments (0)
About PowerShow.com