Run Time Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Run Time Optimization

Description:

Re-compile frequently used regions using more advanced techniques. 10 ... Re-compile (that is select then dynamically) Usually all of the above. 12. Code ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 40
Provided by: arti97
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Run Time Optimization


1
Run Time Optimization
  • 15-745 Optimizing Compilers
  • Pedro Artigas

2
Motivation
  • A good reason
  • Compiling a language that contains run-time
    constructs
  • Java dynamic class loading
  • Perl or Matlab eval(statement)
  • Faster than interpreting
  • A better reason
  • May use program information only available at run
    time

3
Example of run-time information
  • The processor that will be used to run the
    program
  • inc ax is faster on a Pentium III
  • add ax,1 is faster on a Pentium 4
  • No need to recompile if generating code at run
    time
  • The actual program input/run-time behavior
  • Is my profile information accurate for the
    current program input? YES!

4
The life cycle of a program
One Object File Global Analysis
One Binary Whole Program Analysis
One Process Analysis? No observation!
Larger scope, better information about program
behavior
5
New strategies are possible
  • Pessimistic x Optimistic approaches
  • Ex Does int a points to the same location as
    int b ?
  • Compile time/Pessimistic Prove that in ANY
    execution those pointers point to different
    addresses
  • Run Time/Optimistic Up to now in the current
    execution a and b point to different locations
  • Assume this holds
  • If the assumption breaks, invalidate generated
    code and generate new code

6
A sanity check
  • Using run time information does not require run
    time code generation
  • Example Versioning
  • ISA may allow cheaper tests
  • IA-64
  • Transmeta

if (a!b) ltgenerate code assuming a!bgt else
ltgenerate code assuming abgt
7
Drawbacks
  • Code generation has to be FAST
  • Rule of thumb almost linear on program size
  • Code quality Compromise on quality to achieve
    fast code generation
  • shoot for good, not great
  • Also this usually means
  • No time for classical Iterative Data Flow
    Analysis at run time

8
No classical IDFA Solutions
  • Quasi-Static and/or Staged Compilation
  • Perform IDFA at compile time
  • Specialize the dynamic code generator for the
    obtained information
  • That is, encode the obtained data flow
    information in the binary
  • Do not rely on classical IDFA
  • Use algorithms that do not require it
  • Ex Dominator based value numbering (coming up!)
  • Generate code in a style that does not require it
  • Ex One entry multiple exits traces
  • as in deco and dynamo

9
Code generation Strategies
  • Compiling a language that requires run-time code
    generation
  • Compile adaptively
  • Use a very simple and fast code generation scheme
  • Re-compile frequently used regions using more
    advanced techniques

10
Adaptive Compilation Motivation
  • Very simple code generation
  • Higher execution cost
  • Elaborate code generation
  • Higher compilation cost
  • Problem
  • We may not know in advance how frequently a
    region will execute
  • Measure frequencies and re-compile dynamically

Fast compiler
Optimizing compiler
Cost threshold
11
Code generation Strategies
  • Compiling selected regions that benefit from
    run-time code generation
  • Pick only the regions that should benefit the
    most
  • Which regions?
  • Select them statically
  • Use profile information
  • Re-compile (that is select then dynamically)
  • Usually all of the above

12
Code Optimization Unit
  • What is the run-time unit of optimization?
  • Option Procedures/static code regions
  • Similar to static compilers
  • Option Traces
  • Start at the target of a backward branch
  • Include all the instructions in a path
  • May include procedure calls and returns
  • Branches
  • Fall through remain in the trace
  • Target exit the trace

1
2
3
4
4
13
Current strategies
Static region Trace
JIT compilers Java JITs Matlab JITs ?
Run-time performance engines Dyc Fabius Dynamo Deco
14
Run-Time code generationCase studies
  • Two examples of algorithms that are suitable for
    run-time code generation
  • Run time CSE/PRE replacement
  • Dominator based value numbering
  • Run time Register Allocation
  • Linear scan register allocation

15
Sidebar
  • With traces CSE/PRE become almost trivial
  • No need for register allocation if optimizing a
    binary (ex dynamo)

PRE
CSE
16
Review Local value numbering
  • Store expressions already computed (in a hash
    table)
  • Store variable name?VN mapping in the VN array
  • Store VN?variable name mapping in the Name array
  • Same value number?same value
  • for each basic block
  • Table.empty()
  • for each computed expression (xy op z)
  • if VTable.lookup(y op z)
  • VNxV
  • if VNNameVV //expression is still there
  • replace x y op z with x NameV
  • else
  • NameVx
  • else
  • VNxnew_value_number()
  • Table.insert(y op z,VNx)
  • NameVNxx

Expression was computed in the past, check if
result is available
New expression, add to the table
17
Local value numbering
  • Works in linear time on program size
  • Assuming accesses to the array and the hash table
    occur in constant time
  • Can we make it work in a scope larger than a
    basic block? (Hint Yes)
  • What are the potential problems?

18
Problems
  • How to propagate the hash table contents across
    basic blocks?
  • How to make sure that is safe to access the
    location containing the expression in other basic
    blocks?
  • How do we make sure if the location containing
    the expression is fresh?
  • Remember no IDFA

19
Control flow issues
  • On split points things are simple
  • Just keep the content of the hash table from the
    predecessor
  • What about merge points?
  • We do not know if the same expression was
    computed in all incoming paths
  • We do not want to check the fact anyway (why?)
  • Reset the state of the hash table to a safe state
    it had in the past
  • Which program point in the past?
  • The immediate dominator of the merge block

20
Data flow issues
  • Making sure the def of an expression is fresh and
    reaches the blocks of interest
  • How?
  • By construction! SSA
  • All names are fresh (Single Assignment)
  • All defs dominate its uses (regular uses not ?
    functions)
  • As, by construction, we introduce new defs using
    ? functions at every point this would not hold

21
Dominator/SSA based value numbering
  • DVN(Block B)
  • Table.PushScope()
  • for each exp n?()
  • if (exp is redundant or meaningless)
    //meaningless ?(x0,x0)
  • VNn Table.lookup(?() or x0)
  • remove(n?())
  • else
  • VNnn
  • Table.insert(?(),VNn)
  • for each exp xy op z
  • if (vTable.lookup(y op z))
  • VNxv
  • remove(xy op z)
  • else
  • VNxx
  • Table.insert(xy op z,VNx)
  • for each successor s of B
  • Adjust the ? inputs
  • for each dominator tree child c in CFG reverse
    post-order

First process the ? expressions
Them the regular ones
Propagate info about ? inputs and call DVN
recursively
22
Example
VN
Name VN
u0
v0
w0
x0
y0
u1
x1
y1
u2
x2
y2
u3
23
Problems
  • Does not catch
  • But it performs almost as well as CSE
  • And runs much faster
  • linear time ? (YES? NO?)

x0a0b0
x1a0b0
x0a0b0
x1?(x0,x2) x2a0b0
x2?(x0,x1)
24
Homework 4
  • The DVN algorithm scans the CFG in a similar way
    as the second phase of SSA translation
  • SSA translation phase 1
  • Placing ? functions
  • SSA translation phase 2
  • assigning unique numbers to variables
  • Combine both and save one pass
  • Gives us a smaller constant
  • But, at run time, it pays of!

25
Run time register allocation
  • Graph Coloring? Not an option
  • Even the simple stack based heuristic shown in
    class is O(n2)
  • Not even counting
  • Building the graph
  • Move coalescing optimization
  • But register allocation is VERY important in
    terms of performance
  • Remember, memory is REALLY slow
  • We need a simple but effective (almost) linear
    time algorithm

26
Lets start simple
  • Start with a local (basic block) linear time
    algorithm
  • Assuming only one def and one use per variable
    (More constrained than SSA)
  • Assuming that if a variable is spilled it must
    remain spilled (Why?)
  • Can we find an optimum linear time algorithm?
    (Hint Yes)
  • Ideas?
  • Think about liveness first

27
Simple AlgorithmComputing Liveness
  • One def and one use per variable, only one block
  • A live range is merely the interval between the
    def and the use
  • Live Interval Interval between the first def and
    the last use
  • OBS Live Range Live Interval if there is no
    control flow, only one def and use
  • We could compute live intervals using a linear
    scan if we store the def instructions (beginning
    of the interval) in a hash table

28
Example
  • S1 A1
  • S2 B2
  • S3 C3
  • S4 DA
  • S5 EB
  • S6 use(E)
  • S7 use(D)
  • S8 use(C)

29
Now Register Allocation
  • Another linear scan
  • Keep the active intervals in an list (active)
  • Assumption an interval, when spilled, will
    remain spilled
  • Two scenarios
  • 1
  • No problem
  • 2
  • Must spill
  • Which interval?

30
Spilling heuristic
  • Since there is no second chance
  • That is a spilled variable will always remain
    spilled
  • Spill the interval that ends last
  • Intuition As one spill must occur
  • Pick the one that makes the remaining allocation
    least constrained
  • That is, the interval that ends last
  • This is the provably optimum solution (given all
    the constraints)

31
Linear Scan Register Allocation
  • active
  • freeregs all_registers
  • for each interval I (in order of increasing
    start point)
  • for each interval J in active
  • if J.endgtI.start
  • continue
  • active.remove(J)
  • freeregs.insert(J.register)
  • end for each interval J
  • if active.length()R
  • spill_candidadeactive.last()
  • if (spill_candidate.endgtI.end)
  • I.register spill_candidate.register
  • spill(spill_candidate)
  • active.remove(spill_candidate)
  • active.insert_sorted(I) //sorted by end point
  • else
  • spill(I)
  • else

Expire old intervals
Must spill, pick either the last interval in
active or the new interval
No constraints
32
Example (R2)
A B C D E
S1
S2
S3
S4
S5
S6
S7
S8
A
  • S1 A1
  • S2 B2
  • S3 C3
  • S4 DA
  • S5 EB
  • S6 use(E)
  • S7 use(D)
  • S8 use(C)

B
C
D
E
33
Is the second pass really linear?
  • Invariant active.length()ltR
  • Complexity O(Rn)
  • R is usually a small constant (128 at most)
  • Therefore O(n)

34
And we are done! Right?
  • YES and NO
  • Use the same algorithm as before for register
    assignment
  • Program representation Linear list of
    instructions
  • Live intervals are not precise anymore given
    control flow and multiple def/uses
  • Not optimum, but still FAST
  • Code quality within 10 of graph coloring for
    spec95 benchmarks (One problem with this claim)

35
The Worst problem Obtaining precise live
intervals
  • How to obtain precise live interval information
    FAST?
  • Claim of 10 relies on live interval information
    obtained using liveness analysis (IDFA)
  • IDFA is SLOW, O(n3)
  • Most recent solutions
  • Use the local interval algorithm for variables
    that only live inside one basic block
  • Use liveness analysis for more global variables
  • Alleviates the problem, does not fully solve it

36
More problems Live intervals may not be precise
OBS The idea of lifetime holes leads to
allocators that also try to use this holes to
assign the same register to other live ranges
(bin-packing) Such an allocator is used in the
Alpha family of compilers (GEM compilers)
37
Other problems Linearization order
  • Register allocation quality depends on chosen
    block linearization order
  • Choose a good order in practice
  • layout order
  • depth first traversal of the CFG
  • Both only 10 slower than graph coloring

38
Graph coloring versus Linear scan
Compilation cost scaling
39
Conclusion
  • Run time code generation provides new
    optimization opportunities
  • Challenges
  • Identify new optimization opportunities
  • Design new compilation strategies
  • example optimistic versus conservative
  • Design algorithms and implementations that
  • minimize run time overhead
  • Do not compromise much on code quality
  • Recent examples indicate
  • extending fast local methods is a promising way
    to obtain fast run-time code generation
Write a Comment
User Comments (0)
About PowerShow.com