Run Time Optimization - PowerPoint PPT Presentation

About This Presentation

Title:

Run Time Optimization

Description:

Re-compile frequently used regions using more advanced techniques. 10 ... Re-compile (that is select then dynamically) Usually all of the above. 12. Code ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 40

Provided by: arti97

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Run Time Optimization

1
Run Time Optimization

15-745 Optimizing Compilers
Pedro Artigas

2
Motivation

A good reason
Compiling a language that contains run-time
constructs
Java dynamic class loading
Perl or Matlab eval(statement)
Faster than interpreting
A better reason
May use program information only available at run
time

3
Example of run-time information

The processor that will be used to run the
program
inc ax is faster on a Pentium III
add ax,1 is faster on a Pentium 4
No need to recompile if generating code at run
time
The actual program input/run-time behavior
Is my profile information accurate for the
current program input? YES!

4
The life cycle of a program
One Object File Global Analysis
One Binary Whole Program Analysis
One Process Analysis? No observation!
Larger scope, better information about program
behavior
5
New strategies are possible

Pessimistic x Optimistic approaches
Ex Does int a points to the same location as
int b ?
Compile time/Pessimistic Prove that in ANY
execution those pointers point to different
addresses
Run Time/Optimistic Up to now in the current
execution a and b point to different locations
Assume this holds
If the assumption breaks, invalidate generated
code and generate new code

6
A sanity check

Using run time information does not require run
time code generation
Example Versioning
ISA may allow cheaper tests
IA-64
Transmeta

if (a!b) ltgenerate code assuming a!bgt else
ltgenerate code assuming abgt
7
Drawbacks

Code generation has to be FAST
Rule of thumb almost linear on program size
Code quality Compromise on quality to achieve
fast code generation
shoot for good, not great
Also this usually means
No time for classical Iterative Data Flow
Analysis at run time

8
No classical IDFA Solutions

Quasi-Static and/or Staged Compilation
Perform IDFA at compile time
Specialize the dynamic code generator for the
obtained information
That is, encode the obtained data flow
information in the binary
Do not rely on classical IDFA
Use algorithms that do not require it
Ex Dominator based value numbering (coming up!)
Generate code in a style that does not require it
Ex One entry multiple exits traces
as in deco and dynamo

9
Code generation Strategies

Compiling a language that requires run-time code
generation
Compile adaptively
Use a very simple and fast code generation scheme
Re-compile frequently used regions using more
advanced techniques

10
Adaptive Compilation Motivation

Very simple code generation
Higher execution cost
Elaborate code generation
Higher compilation cost
Problem
We may not know in advance how frequently a
region will execute
Measure frequencies and re-compile dynamically

Fast compiler
Optimizing compiler
Cost threshold
11
Code generation Strategies

Compiling selected regions that benefit from
run-time code generation
Pick only the regions that should benefit the
most
Which regions?
Select them statically
Use profile information
Re-compile (that is select then dynamically)
Usually all of the above

12
Code Optimization Unit

What is the run-time unit of optimization?
Option Procedures/static code regions
Similar to static compilers
Option Traces
Start at the target of a backward branch
Include all the instructions in a path
May include procedure calls and returns
Branches
Fall through remain in the trace
Target exit the trace

1
2
3
4
4
13
Current strategies
Static region Trace
JIT compilers Java JITs Matlab JITs ?
Run-time performance engines Dyc Fabius Dynamo Deco
14
Run-Time code generationCase studies

Two examples of algorithms that are suitable for
run-time code generation
Run time CSE/PRE replacement
Dominator based value numbering
Run time Register Allocation
Linear scan register allocation

15
Sidebar

With traces CSE/PRE become almost trivial
No need for register allocation if optimizing a
binary (ex dynamo)

PRE
CSE
16
Review Local value numbering

Store expressions already computed (in a hash
table)
Store variable name?VN mapping in the VN array
Store VN?variable name mapping in the Name array
Same value number?same value
for each basic block
Table.empty()
for each computed expression (xy op z)
if VTable.lookup(y op z)
VNxV
if VNNameVV //expression is still there
replace x y op z with x NameV
else
NameVx
else
VNxnew_value_number()
Table.insert(y op z,VNx)
NameVNxx

Expression was computed in the past, check if
result is available
New expression, add to the table
17
Local value numbering

Works in linear time on program size
Assuming accesses to the array and the hash table
occur in constant time
Can we make it work in a scope larger than a
basic block? (Hint Yes)
What are the potential problems?

18
Problems

How to propagate the hash table contents across
basic blocks?
How to make sure that is safe to access the
location containing the expression in other basic
blocks?
How do we make sure if the location containing
the expression is fresh?
Remember no IDFA

19
Control flow issues

On split points things are simple
Just keep the content of the hash table from the
predecessor
What about merge points?
We do not know if the same expression was
computed in all incoming paths
We do not want to check the fact anyway (why?)
Reset the state of the hash table to a safe state
it had in the past
Which program point in the past?
The immediate dominator of the merge block

20
Data flow issues

Making sure the def of an expression is fresh and
reaches the blocks of interest
How?
By construction! SSA
All names are fresh (Single Assignment)
All defs dominate its uses (regular uses not ?
functions)
As, by construction, we introduce new defs using
? functions at every point this would not hold

21
Dominator/SSA based value numbering

DVN(Block B)
Table.PushScope()
for each exp n?()
if (exp is redundant or meaningless)
//meaningless ?(x0,x0)
VNn Table.lookup(?() or x0)
remove(n?())
else
VNnn
Table.insert(?(),VNn)
for each exp xy op z
if (vTable.lookup(y op z))
VNxv
remove(xy op z)
else
VNxx
Table.insert(xy op z,VNx)
for each successor s of B
Adjust the ? inputs
for each dominator tree child c in CFG reverse
post-order

First process the ? expressions
Them the regular ones
Propagate info about ? inputs and call DVN
recursively
22
Example
VN
Name VN
u0
v0
w0
x0
y0
u1
x1
y1
u2
x2
y2
u3
23
Problems

Does not catch
But it performs almost as well as CSE
And runs much faster
linear time ? (YES? NO?)

x0a0b0
x1a0b0
x0a0b0
x1?(x0,x2) x2a0b0
x2?(x0,x1)
24
Homework 4

The DVN algorithm scans the CFG in a similar way
as the second phase of SSA translation
SSA translation phase 1
Placing ? functions
SSA translation phase 2
assigning unique numbers to variables
Combine both and save one pass
Gives us a smaller constant
But, at run time, it pays of!

25
Run time register allocation

Graph Coloring? Not an option
Even the simple stack based heuristic shown in
class is O(n2)
Not even counting
Building the graph
Move coalescing optimization
But register allocation is VERY important in
terms of performance
Remember, memory is REALLY slow
We need a simple but effective (almost) linear
time algorithm

26
Lets start simple

Start with a local (basic block) linear time
algorithm
Assuming only one def and one use per variable
(More constrained than SSA)
Assuming that if a variable is spilled it must
remain spilled (Why?)
Can we find an optimum linear time algorithm?
(Hint Yes)
Ideas?
Think about liveness first

27
Simple AlgorithmComputing Liveness

One def and one use per variable, only one block
A live range is merely the interval between the
def and the use
Live Interval Interval between the first def and
the last use
OBS Live Range Live Interval if there is no
control flow, only one def and use
We could compute live intervals using a linear
scan if we store the def instructions (beginning
of the interval) in a hash table

28
Example

S1 A1
S2 B2
S3 C3
S4 DA
S5 EB
S6 use(E)
S7 use(D)
S8 use(C)

29
Now Register Allocation

Another linear scan
Keep the active intervals in an list (active)
Assumption an interval, when spilled, will
remain spilled
Two scenarios
1
No problem
2
Must spill
Which interval?

30
Spilling heuristic

Since there is no second chance
That is a spilled variable will always remain
spilled
Spill the interval that ends last
Intuition As one spill must occur
Pick the one that makes the remaining allocation
least constrained
That is, the interval that ends last
This is the provably optimum solution (given all
the constraints)

31
Linear Scan Register Allocation

active
freeregs all_registers
for each interval I (in order of increasing
start point)
for each interval J in active
if J.endgtI.start
continue
active.remove(J)
freeregs.insert(J.register)
end for each interval J
if active.length()R
spill_candidadeactive.last()
if (spill_candidate.endgtI.end)
I.register spill_candidate.register
spill(spill_candidate)
active.remove(spill_candidate)
active.insert_sorted(I) //sorted by end point
else
spill(I)
else

Expire old intervals
Must spill, pick either the last interval in
active or the new interval
No constraints
32
Example (R2)
A B C D E
S1
S2
S3
S4
S5
S6
S7
S8
A

S1 A1
S2 B2
S3 C3
S4 DA
S5 EB
S6 use(E)
S7 use(D)
S8 use(C)

B
C
D
E
33
Is the second pass really linear?

Invariant active.length()ltR
Complexity O(Rn)
R is usually a small constant (128 at most)
Therefore O(n)

34
And we are done! Right?

YES and NO
Use the same algorithm as before for register
assignment
Program representation Linear list of
instructions
Live intervals are not precise anymore given
control flow and multiple def/uses
Not optimum, but still FAST
Code quality within 10 of graph coloring for
spec95 benchmarks (One problem with this claim)

35
The Worst problem Obtaining precise live
intervals

How to obtain precise live interval information
FAST?
Claim of 10 relies on live interval information
obtained using liveness analysis (IDFA)
IDFA is SLOW, O(n3)
Most recent solutions
Use the local interval algorithm for variables
that only live inside one basic block
Use liveness analysis for more global variables
Alleviates the problem, does not fully solve it

36
More problems Live intervals may not be precise
OBS The idea of lifetime holes leads to
allocators that also try to use this holes to
assign the same register to other live ranges
(bin-packing) Such an allocator is used in the
Alpha family of compilers (GEM compilers)
37
Other problems Linearization order

Register allocation quality depends on chosen
block linearization order
Choose a good order in practice
layout order
depth first traversal of the CFG
Both only 10 slower than graph coloring

38
Graph coloring versus Linear scan
Compilation cost scaling
39
Conclusion

Run time code generation provides new
optimization opportunities
Challenges
Identify new optimization opportunities
Design new compilation strategies
example optimistic versus conservative
Design algorithms and implementations that
minimize run time overhead
Do not compromise much on code quality
Recent examples indicate
extending fast local methods is a promising way
to obtain fast run-time code generation