Title: Exhaustive Phase Order Search Space Exploration and Evaluation
1Exhaustive Phase Order Search Space Exploration
and Evaluation
- by
- Prasad Kulkarni
- (Florida State University)
2Compiler Optimizations
- To improve efficiency of compiler generated code
- Optimization phases require enabling conditions
- need specific patterns in the code
- many also need available registers
- Phases interact with each other
- Applying optimizations in different orders
generates different code
3Phase Ordering Problem
- To find an ordering of optimization phases that
produces optimal code with respect to possible
phase orderings - Evaluating each sequence involves compiling,
assembling, linking, execution and verifying
results - Best optimization phase ordering depends on
- source application
- target platform
- implementation of optimization phases
- Long standing problem in compiler optimization!!
4Phase Ordering Space
- Current compilers incorporate numerous different
optimization phases - 15 distinct phases in our compiler backend
- 15! 1,307,674,368,000
- Phases can enable each other
- any phase can be active multiple times
- 1515 437,893,890,380,859,375
- cannot restrict sequence length to 15
- 1544 5.598 1051
5Addressing Phase Ordering
- Exhaustive Search
- universally considered intractable
- We are now able to exhaustively evaluate the
optimization phase order space.
6Re-stating of Phase Ordering
- Earlier approach
- explicitly enumerate all possible optimization
phase orderings - Our approach
- explicitly enumerate all function instances that
can be produced by any combination of phases
7Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
8Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
9Experimental Framework
- We used the VPO compilation system
- established compiler framework, started
development in 1988 - comparable performance to gcc O2
- VPO performs all transformations on a single
representation (RTLs), so it is possible to
perform most phases in an arbitrary order - Experiments use all the 15 re-orderable
optimization phases in VPO - Target architecture was the StrongARM SA-100
processor
10VPO Optimization Phases
ID Optimization Phase ID Optimization Phase
b branch chaining l loop transformations
c common subexpr. elim. n code abstraction
d remv. unreachable code o eval. order determin.
g loop unrolling q strength reduction
h dead assignment elim. r reverse branches
i block reordering s instruction selection
j minimize loop jumps u remv. useless jumps
k register allocation
11Disclaimers
- Did not include optimization phases normally
associated with compiler front ends - no memory hierarchy optimizations
- no inlining or other interprocedural
optimizations - Did not vary how phases are applied
- Did not include optimizations that require
profile data
12Benchmarks
- 12 MiBench benchmarks 244 functions
Category Program Description
auto bitcount test processor bit manipulation abilities
auto qsort sort strings using quicksort sorting algorithm
network dijkstra Dijkstras shortest path algorithm
network patricia construct patricia trie for IP traffic
telecomm fft fast fourier transform
telecomm adpcm compress 16-bit linear PCM samples to 4-bit
consumer jpeg image compression and decompression
consumer tiff2bw convert color .tiff image to bw image
security sha secure hash algorithm
security blowfish symmetric block cipher with variable length key
office stringsearch searches for given words in phrases
office ispell fast spelling checker
13Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
14Terminology
- Active phase An optimization phase that
modifies the function representation - Dormant phase A phase that is unable to find
any opportunity to change the function - Function instance any semantically,
syntactically, and functionally correct
representation of the source function (that can
be produced by our compiler)
15Naïve Optimization Phase Order Space
- All combinations of optimization phase sequences
are attempted
L0
d
a
c
b
L1
d
a
d
a
d
a
d
a
b
c
b
c
b
c
b
c
L2
16Eliminating Consecutively Applied Phases
- A phase just applied in our compiler cannot be
immediately active again
L0
d
a
c
b
L1
d
a
d
a
d
a
d
a
b
c
c
b
b
c
b
c
L2
17Eliminating Dormant Phases
- Get feedback from the compiler indicating if any
transformations were successfully applied in a
phase.
L0
d
a
c
b
L1
d
a
d
a
d
a
b
c
c
b
b
c
L2
18Identical Function Instances
- Some optimization phases are independent
- example branch chaining register allocation
- Different phase sequences can produce the same
code
- r2 1
- r3 r4 r2
- instruction selection
- r3 r4 1
- r2 1
- r3 r4 r2
- constant propagation
- r2 1
- r3 r4 1
- dead assignment elimination
- r3 r4 1
19Equivalent Function Instances
sum 0 for (i 0 i lt 1000 i ) sum
a i Source Code
r100 r12HIa r12r12LOa
r1r12 r94000r12 L3
r8Mr1 r10r10r8 r1r14
ICr1?r9 PCIClt0,L3 Register
Allocation before Code Motion
r110 r10HIa r10r10LOa
r1r10 r94000r10 L5
r8Mr1 r11r11r8 r1r14
ICr1?r9 PCIClt0,L5 Code Motion
before Register Allocation
r320 r33HIa r33r33LOa
r34r33 r354000r33 L01
r36Mr34 r32r32r36
r34r344 ICr34?r35 PCIClt0,L01
After Mapping Registers
20Efficient Detection of Unique Function Instances
- After pruning dormant phases there may be tens or
hundreds of thousands of unique instances - Use a CRC (cyclic redundancy check) checksum on
the bytes of the RTLs representing the
instructions - Used a hash table to check if an identical or
equivalent function instance already exists in
the DAG
21Eliminating Identical/Equivalent Function
Instances
- Resulting search space is a DAG of function
instances
L0
a
c
b
L1
a
d
a
d
d
c
L2
22Static Enumeration Results
Function Inst. Fn_inst Len CF Batch Vs. optimal
start_input_bmp (j) 1,372 120,777 25 70 1.41
correct(i) 1,295 1,348,154 25 663 4.18
main (t) 1,276 2,882,021 29 389 16.25
parse_switches (j) 1,228 180,762 20 53 0.41
start_input_gif (j) 1009 39,352 21 18 2.46
start_input_tga (j) 972 63,458 21 30 1.66
askmode (i) 942 232,453 24 108 7.87
skiptoword (i) 901 439,994 22 103 1.45
start_input_ppm (j) 795 8,521 16 45 2.70
Average (234) 196.2 89,946.7 14.7 36.2 6.46
23Exhaustively enumerated the optimization phase
order space tofind an optimal phase ordering
with respect to code-size
Published in CGO 06
24Determining Program Performance
- Almost 175,000 distinct function instances, on
average - largest enumerated function has 2,882,021
instances - Too time consuming to execute each distinct
function instance - assemble ? link ? execute more expensive than
compilation - Many embedded development environments use
simulation - simulation orders of magnitude more expensive
than execution - Use data obtained from a few executions to
estimate the performance of all remaining
function instances
25Determining Program Performance (cont...)
- Function instances having identical control-flow
graphs execute each block the same number of
times - Execute application once for each control-flow
structure - Statically estimate the number of cycles required
to execute each basic block - dynamic frequency measure
- S (static cycles block frequency)
26Predicting Relative Performance I
20
20
4 cycles
4 cycles
5
5
27 cycles
25 cycles
15
15
22 cycles
20 cycles
2
2
2 cycles
2 cycles
5
5
5 cycles
10 cycles
20
20
10 cycles
10 cycles
Total cycles 789
Total cycles 744
27Dynamic Frequency Results
Function Inst. Fn_inst Len CF Leaf from optimal from optimal
Function Inst. Fn_inst Len CF Leaf Batch Worst
main (t) 1,276 2,882,021 29 389 15,164 0.0 84.3
parse_switches (j) 1,228 180,762 20 53 2,027 6.7 64.8
askmode (i) 942 232,453 24 108 475 8.4 56.2
skiptoword (i) 901 439,994 22 103 2,834 6.1 49.6
start_input_ppm (j) 795 8,521 16 45 80 1.7 28.4
pfx_list_chk (i) 640 1,269,638 44 136 4,660 4.3 78.6
main (f) 624 2,789,903 33 122 4,214 7.5 46.1
sha_transform (h) 541 548,812 32 98 5,262 9.6 133.4
main (p) 483 14,510 15 10 178 7.7 13.1
Average (79) 234.3 174,574.8 16.1 47.4 813.4 4.8 65.4
28Correlation Dynamic Frequency Counts Vs.
Simulator Cycles
- Static performance estimation is inaccurate
- ignored cache/branch misprediction penalties
- Dynamic frequency counts may be sufficiently
accurate - simplification of the estimation problem
- most embedded systems have simpler architectures
- We show strong correlation between our measure of
performance and simulator cycles
29Complete Function Correlation
- Example init_search in stringsearch
30Leaf Function Correlation
- Leaf function instances are generated when no
additional phases can be successfully applied - Leaf instances provide a good sampling
- represents the only code that can be generated by
an aggressive compiler, like VPO - at least one leaf instance represents an optimal
phase ordering for over 86 of functions - significant percent of leaf instances among
optimal
31Leaf Function Correlation Statistics
- Pearsons correlation coefficient
-
- Accuracy of our estimate of optimal perf.
-
Sxy (SxSy)/n
Pcorr
sqrt( (Sx2 (Sx)2/n) (Sy2 - (Sy)2/n) )
cycle count for best leaf
Lcorr
cy. cnt for leaf with best dynamic freq count
32Leaf Function Correlation Statistics (cont)
Function Pcorr Lcorr 0 Lcorr 0 Lcorr 1 Lcorr 1
Function Pcorr Ratio Leaves Ratio Leaves
AR_btbl...(b) 1.00 1.00 1 1.00 1
BW_btbl...(b) 1.00 1.00 2 1.00 2
bit_count.(b) 1.00 1.00 2 1.00 2
bit_shifter(b) 1.00 1.00 2 1.00 2
bitcount(b) 0.89 0.92 1 0.92 1
main(b) 1.00 1.00 6 1.00 23
ntbl_bitcnt(b) 1.00 0.95 2 0.95 2
ntbl_bit(b) 0.99 1.00 2 1.00 2
dequeue(d) 0.99 1.00 6 1.00 6
dijkstra(d) 1.00 0.97 4 1.00 269
.... . . . . .
average 0.96 0.98 4.38 0.996 21
33Exhaustively evaluated the optimization phase
order space tofind a near-optimal phase
orderingwith respect to simulator cycles
Published in LCTES 06
34Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
35Phase Enabling Interaction
- b enables a along the path a-b-a
a
c
b
a
c
c
b
b
a
d
36Phase Enabling Probabilities
Ph St b c d g h i j k l n o q r s u
b 0.72 0.02 0.01 0.04 0.01 0.02 0.66
c 1.00 0.01 0.68 0.01 0.02 0.07 0.05 0.15 0.34
d 1.00 1.00 1.00
g 0.22 0.28 0.17 0.05 0.02 0.14 0.34 0.09 0.15
h 0.08 0.16 0.14 0.02 0.01 0.20
i 0.72 0.04 0.01 0.09
j 0.03 0.06 0.44
k 0.98 0.28 0.01 0.02 0.01 0.96
l 0.60 0.73 0.02 0.01 0.01 0.03 0.53
n 0.41 0.36 0.01 0.01 0.01 0.29
o 0.88 0.40 0.03
q 0.99 0.02 0.99
r 0.57 0.06 0.06
s 1.00 0.33 0.41 0.83 0.07 0.05 0.15 0.07
u 0.01 0.01 0.02
37Phase Disabling Interaction
- b disables a along the path b-c-d
a
c
b
a
c
c
b
b
a
d
38Disabling Probabilities
Ph b c d g h i j k l n o q r s u
b 1.00 0.28 0.09 0.18 0.20 0.11 0.01
c 0.01 1.00 0.02 0.08 0.02 0.30 0.32 1.00 0.08
d 1.00 0.03 0.01 0.01
g 0.13 1.00 0.06 0.01 0.12 0.22
h 0.01 0.01 1.00 0.04 0.10 1.00 0.01
i 0.02 0.22 1.00 0.20 0.01 0.44 0.91
j 0.01 0.08 1.00 0.01 0.16
k 0.01 0.05 1.00 0.05 0.14 1.00
l 0.02 1.00 0.11 0.04 0.07 1.00 0.32 1.00
n 0.07 0.01 0.02 0.01 0.01 1.00 1.00 0.01
o 0.01 0.08 0.01 1.00
q 1.00
r 0.06 0.20 0.36 1.00 0.05
s 0.07 0.03 0.31 0.22 0.14 0.26 0.02 1.00
u 0.41 0.02 0.34 0.15 1.00
39Faster Conventional Compiler
- Modified VPO to use enabling and disabling phase
probabilities to decrease compilation time
pi - current probability of phase i
being active eij - probability of phase j
enabling phase i dij - probability of
phase j disabling phase i For each phase i
do pi eist While (any pi gt
0) do Select j as the current phase with
highest probability of being active Apply
phase j If phase j was active then
For each phase i, where i ! j do pi
((1-pi) eij) - (pi dij)
pj 0
40Probabilistic Compilation Results
Function Old Compilation Old Compilation Prob. Compilation Prob. Compilation Prob. / Old Prob. / Old Prob. / Old
Attempted Active Attempted Active Time Size Speed
start_inp...(j) 233 16 55 14 0.469 1.014 N/A
parse_swi...(j) 233 14 53 12 0.371 1.016 0.972
start_inp...(j) 270 15 55 14 0.353 1.010 N/A
start_inp...(j) 233 14 49 13 0.420 1.003 N/A
start_inp...(j) 231 11 53 12 0.436 1.004 1.000
fft_float(f) 463 28 99 25 0.451 1.012 0.974
main(f) 284 20 73 18 0.550 1.007 1.000
sha_trans...(h) 284 17 67 16 0.605 0.965 0.953
read_scan...(j) 233 13 43 10 0.342 1.018 N/A
LZWReadByte(j) 268 12 45 11 0.325 1.014 N/A
main(j) 270 12 57 14 0.375 1.007 1.000
dijkstra(d) 231 9 43 9 0.409 1.010 1.000
.... .... .... .... .... .... .... ....
average 230.3 8.9 47.7 9.6 0.297 1.015 1.005
41Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
42Conclusions
- Phase ordering problem
- long standing problem in compiler optimization
- exhaustive evaluation always considered
infeasible - Exhaustively evaluated the phase order space
- re-interpretation of the problem
- novel application of search algorithms
- fast pruning techniques
- accurate prediction of relative performance
- Analyzed properties of the phase order space to
speedup conventional compilation - published in CGO06, LCTES06, submitted to TOPLAS
43Challenges
- Exhaustive phase order search is a severe stress
test for the compiler - isolate analysis required and invalidated by each
phase - produce correct code for all phase orderings
- eliminate all memory leaks
- Search algorithm needs to be highly efficient
- used CRCs and hashes for function comparisons
- stored intermediate function instances to reduce
disk access - maintained logs to restart search after crash
44Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
45VISTA
- Provides an interactive code improvement paradigm
- view low-level program representation
- apply existing phases and manual changes in any
order - browse and undo previous changes
- automatically obtain performance information
- automatically search for effective phase
sequences - Useful as a research as well as teaching tool
- employed in three universities
- published in LCTES 03, TECS 06
46VISTA Main Window
47Faster Genetic Algorithm Searches
- Improving performance of genetic algorithms
- avoid redundant executions of the application
- over 87 of executions were avoided
- reduce search time by 62
- modify search to obtain comparable results in
fewer generations - reduced GA generations by 59
- reduce search time by 35
- published in PLDI 04, TACO 05
48Heuristic Search Algorithms
- Analyzing the phase order space to improve
heuristic algorithms - detailed performance and cost comparison of
different heuristic algorithms - demonstrated the importance and difficulty of
selecting the correct sequence length - illustrated the importance of leaf function
instances - proposed modifications to existing algorithms,
and new search algorithms - published in CGO 07
49Dynamic Compilation
- Explored asynchronous dynamic compilation in a
virtual machine - demonstrated shortcomings of current popular
compilation strategy - describe importance of minimum compiler
utilization - discussed new compilation strategies
- explored the changes needed to current
compilation strategies to exploit free cycles - Will be published in VEE 07
50Outline
- Experimental framework
- Exhaustive phase order space evaluation
- Faster conventional compilation
- Conclusions
- Summary of my other work
- Future research directions
51Compiler Technology Challenges
r
c
o
m
p
l
e
i
Machine Architecture
High Level Language
- Support for parallelism
- traditional languages
- express parallelism
- dynamic scheduling
- Virtual machines
- dynamic code generation and optimization
- Push compilation decisions further down
- multi-core
- heterogeneous cores
- No great solution
- performance monitoring
- software-controlled reconfiguration
- Can no longer do it alone
52Iterative Compilation Machine Learning
- Improved scope for iterative compilation
machine learning - proliferation of new architectures
- automate tuning compiler heuristics
- tuning important libraries
- using performance monitors
- dynamic JIT compilers
- How to use machine learning to optimize and
schedule more efficiently ?
53Heterogeneous Multi-core Architectures
- Can provide the best performance, cost, power
balance - Challenges
- schedule tasks, allocate resources
- dynamic core-specific optimization
- automatic data layout to prevent conflicts
54Dynamic Compilation
- Virtual machines likely to grow in importance
- productivity, portability, interoperability,
isolation... - Challenges
- when, what, how to parallelize
- using hardware performance monitors
- using static analyses to aid dynamic compilation
- debugging tools for correctness and performance
debugging
55Questions ?
56Results Code Size Summary
- Exhaustively enumerated 234 out of 244 functions
- Sequence length
- Maximum 44 Average 14.71
- Distinct function instances
- Maximum 2,882,021 Average 89,947
- Distinct control-flows
- Maximum 920 Average 36.2
- Code size improvement over default sequence
- Maximum 63 Average 6.46
57Results Performance Summary
- Exhaustively evaluated 79 out of 88 functions
- Sequence length
- Maximum 44 Average 16.1
- Distinct function instances
- Maximum 2,882,021 Average 174,574.8
- Distinct control-flows
- Maximum 920 Average 47.4
- Performance improvement over default sequence
- Maximum 15 Average 4.8
58Leaf Vs. Non-Leaf Performance
59Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
60Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
61Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
62Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
63Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
64Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
65Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
66Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
67Weighted Function Instance DAG
- Each node is weighted by the number of paths to a
leaf node
68Predicting Relative Performance II
5
?
4 cycles
4 cycles
?
5
26 cycles
?
10 cycles
15 cycles
?
90 cycles
?
5
?
10 cycles
15 cycles
2 cycles
?
5
44 cycles
10 cycles
?
10 cycles
Total cycles ??
Total cycles 170
69Case when No Leaf is Optimal