Exhaustive Phase Order Search Space Exploration and Evaluation

About This Presentation

Title:

Exhaustive Phase Order Search Space Exploration and Evaluation

Description:

Exhaustive Search. universally considered intractable ... auto. Description. Program. Category. 13 / 55. Outline. Experimental framework ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 67

Provided by: pras78

Learn more at: http://www.ittc.ku.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exhaustive Phase Order Search Space Exploration and Evaluation

1
Exhaustive Phase Order Search Space Exploration
and Evaluation

by
Prasad Kulkarni
(Florida State University)

2
Compiler Optimizations

To improve efficiency of compiler generated code
Optimization phases require enabling conditions
need specific patterns in the code
many also need available registers
Phases interact with each other
Applying optimizations in different orders
generates different code

3
Phase Ordering Problem

To find an ordering of optimization phases that
produces optimal code with respect to possible
phase orderings
Evaluating each sequence involves compiling,
assembling, linking, execution and verifying
results
Best optimization phase ordering depends on
source application
target platform
implementation of optimization phases
Long standing problem in compiler optimization!!

4
Phase Ordering Space

Current compilers incorporate numerous different
optimization phases
15 distinct phases in our compiler backend
15! 1,307,674,368,000
Phases can enable each other
any phase can be active multiple times
1515 437,893,890,380,859,375
cannot restrict sequence length to 15
1544 5.598 1051

5
Addressing Phase Ordering

Exhaustive Search
universally considered intractable
We are now able to exhaustively evaluate the
optimization phase order space.

6
Re-stating of Phase Ordering

Earlier approach
explicitly enumerate all possible optimization
phase orderings
Our approach
explicitly enumerate all function instances that
can be produced by any combination of phases

7
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

8
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

9
Experimental Framework

We used the VPO compilation system
established compiler framework, started
development in 1988
comparable performance to gcc O2
VPO performs all transformations on a single
representation (RTLs), so it is possible to
perform most phases in an arbitrary order
Experiments use all the 15 re-orderable
optimization phases in VPO
Target architecture was the StrongARM SA-100
processor

10
VPO Optimization Phases
ID Optimization Phase ID Optimization Phase
b branch chaining l loop transformations
c common subexpr. elim. n code abstraction
d remv. unreachable code o eval. order determin.
g loop unrolling q strength reduction
h dead assignment elim. r reverse branches
i block reordering s instruction selection
j minimize loop jumps u remv. useless jumps
k register allocation
11
Disclaimers

Did not include optimization phases normally
associated with compiler front ends
no memory hierarchy optimizations
no inlining or other interprocedural
optimizations
Did not vary how phases are applied
Did not include optimizations that require
profile data

12
Benchmarks

12 MiBench benchmarks 244 functions

Category Program Description
auto bitcount test processor bit manipulation abilities
auto qsort sort strings using quicksort sorting algorithm
network dijkstra Dijkstras shortest path algorithm
network patricia construct patricia trie for IP traffic
telecomm fft fast fourier transform
telecomm adpcm compress 16-bit linear PCM samples to 4-bit
consumer jpeg image compression and decompression
consumer tiff2bw convert color .tiff image to bw image
security sha secure hash algorithm
security blowfish symmetric block cipher with variable length key
office stringsearch searches for given words in phrases
office ispell fast spelling checker
13
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

14
Terminology

Active phase An optimization phase that
modifies the function representation
Dormant phase A phase that is unable to find
any opportunity to change the function
Function instance any semantically,
syntactically, and functionally correct
representation of the source function (that can
be produced by our compiler)

15
Naïve Optimization Phase Order Space

All combinations of optimization phase sequences
are attempted

L0
d
a
c
b
L1
d
a
d
a
d
a
d
a
b
c
b
c
b
c
b
c
L2
16
Eliminating Consecutively Applied Phases

A phase just applied in our compiler cannot be
immediately active again

L0
d
a
c
b
L1
d
a
d
a
d
a
d
a
b
c
c
b
b
c
b
c
L2
17
Eliminating Dormant Phases

Get feedback from the compiler indicating if any
transformations were successfully applied in a
phase.

L0
d
a
c
b
L1
d
a
d
a
d
a
b
c
c
b
b
c
L2
18
Identical Function Instances

Some optimization phases are independent
example branch chaining register allocation
Different phase sequences can produce the same
code

r2 1
r3 r4 r2
instruction selection
r3 r4 1

r2 1
r3 r4 r2
constant propagation
r2 1
r3 r4 1
dead assignment elimination
r3 r4 1

19
Equivalent Function Instances
sum 0 for (i 0 i lt 1000 i ) sum
a i Source Code
r100 r12HIa r12r12LOa
r1r12 r94000r12 L3
r8Mr1 r10r10r8 r1r14
ICr1?r9 PCIClt0,L3 Register
Allocation before Code Motion
r110 r10HIa r10r10LOa
r1r10 r94000r10 L5
r8Mr1 r11r11r8 r1r14
ICr1?r9 PCIClt0,L5 Code Motion
before Register Allocation
r320 r33HIa r33r33LOa
r34r33 r354000r33 L01
r36Mr34 r32r32r36
r34r344 ICr34?r35 PCIClt0,L01
After Mapping Registers
20
Efficient Detection of Unique Function Instances

After pruning dormant phases there may be tens or
hundreds of thousands of unique instances
Use a CRC (cyclic redundancy check) checksum on
the bytes of the RTLs representing the
instructions
Used a hash table to check if an identical or
equivalent function instance already exists in
the DAG

21
Eliminating Identical/Equivalent Function
Instances

Resulting search space is a DAG of function
instances

L0
a
c
b
L1
a
d
a
d
d
c
L2
22
Static Enumeration Results
Function Inst. Fn_inst Len CF Batch Vs. optimal
start_input_bmp (j) 1,372 120,777 25 70 1.41
correct(i) 1,295 1,348,154 25 663 4.18
main (t) 1,276 2,882,021 29 389 16.25
parse_switches (j) 1,228 180,762 20 53 0.41
start_input_gif (j) 1009 39,352 21 18 2.46
start_input_tga (j) 972 63,458 21 30 1.66
askmode (i) 942 232,453 24 108 7.87
skiptoword (i) 901 439,994 22 103 1.45
start_input_ppm (j) 795 8,521 16 45 2.70

Average (234) 196.2 89,946.7 14.7 36.2 6.46
23
Exhaustively enumerated the optimization phase
order space tofind an optimal phase ordering
with respect to code-size
Published in CGO 06
24
Determining Program Performance

Almost 175,000 distinct function instances, on
average
largest enumerated function has 2,882,021
instances
Too time consuming to execute each distinct
function instance
assemble ? link ? execute more expensive than
compilation
Many embedded development environments use
simulation
simulation orders of magnitude more expensive
than execution
Use data obtained from a few executions to
estimate the performance of all remaining
function instances

25
Determining Program Performance (cont...)

Function instances having identical control-flow
graphs execute each block the same number of
times
Execute application once for each control-flow
structure
Statically estimate the number of cycles required
to execute each basic block
dynamic frequency measure
S (static cycles block frequency)

26
Predicting Relative Performance I
20
20
4 cycles
4 cycles
5
5
27 cycles
25 cycles
15
15
22 cycles
20 cycles
2
2
2 cycles
2 cycles
5
5
5 cycles
10 cycles
20
20
10 cycles
10 cycles
Total cycles 789
Total cycles 744
27
Dynamic Frequency Results
Function Inst. Fn_inst Len CF Leaf from optimal from optimal
Function Inst. Fn_inst Len CF Leaf Batch Worst
main (t) 1,276 2,882,021 29 389 15,164 0.0 84.3
parse_switches (j) 1,228 180,762 20 53 2,027 6.7 64.8
askmode (i) 942 232,453 24 108 475 8.4 56.2
skiptoword (i) 901 439,994 22 103 2,834 6.1 49.6
start_input_ppm (j) 795 8,521 16 45 80 1.7 28.4
pfx_list_chk (i) 640 1,269,638 44 136 4,660 4.3 78.6
main (f) 624 2,789,903 33 122 4,214 7.5 46.1
sha_transform (h) 541 548,812 32 98 5,262 9.6 133.4
main (p) 483 14,510 15 10 178 7.7 13.1

Average (79) 234.3 174,574.8 16.1 47.4 813.4 4.8 65.4
28
Correlation Dynamic Frequency Counts Vs.
Simulator Cycles

Static performance estimation is inaccurate
ignored cache/branch misprediction penalties
Dynamic frequency counts may be sufficiently
accurate
simplification of the estimation problem
most embedded systems have simpler architectures
We show strong correlation between our measure of
performance and simulator cycles

29
Complete Function Correlation

Example init_search in stringsearch

30
Leaf Function Correlation

Leaf function instances are generated when no
additional phases can be successfully applied
Leaf instances provide a good sampling
represents the only code that can be generated by
an aggressive compiler, like VPO
at least one leaf instance represents an optimal
phase ordering for over 86 of functions
significant percent of leaf instances among
optimal

31
Leaf Function Correlation Statistics

Pearsons correlation coefficient
Accuracy of our estimate of optimal perf.

Sxy (SxSy)/n
Pcorr
sqrt( (Sx2 (Sx)2/n) (Sy2 - (Sy)2/n) )
cycle count for best leaf
Lcorr
cy. cnt for leaf with best dynamic freq count
32
Leaf Function Correlation Statistics (cont)
Function Pcorr Lcorr 0 Lcorr 0 Lcorr 1 Lcorr 1
Function Pcorr Ratio Leaves Ratio Leaves
AR_btbl...(b) 1.00 1.00 1 1.00 1
BW_btbl...(b) 1.00 1.00 2 1.00 2
bit_count.(b) 1.00 1.00 2 1.00 2
bit_shifter(b) 1.00 1.00 2 1.00 2
bitcount(b) 0.89 0.92 1 0.92 1
main(b) 1.00 1.00 6 1.00 23
ntbl_bitcnt(b) 1.00 0.95 2 0.95 2
ntbl_bit(b) 0.99 1.00 2 1.00 2
dequeue(d) 0.99 1.00 6 1.00 6
dijkstra(d) 1.00 0.97 4 1.00 269
.... . . . . .

average 0.96 0.98 4.38 0.996 21
33
Exhaustively evaluated the optimization phase
order space tofind a near-optimal phase
orderingwith respect to simulator cycles
Published in LCTES 06
34
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

35
Phase Enabling Interaction

b enables a along the path a-b-a

a
c
b
a
c
c
b
b
a
d
36
Phase Enabling Probabilities
Ph St b c d g h i j k l n o q r s u

b 0.72 0.02 0.01 0.04 0.01 0.02 0.66
c 1.00 0.01 0.68 0.01 0.02 0.07 0.05 0.15 0.34
d 1.00 1.00 1.00
g 0.22 0.28 0.17 0.05 0.02 0.14 0.34 0.09 0.15
h 0.08 0.16 0.14 0.02 0.01 0.20
i 0.72 0.04 0.01 0.09
j 0.03 0.06 0.44
k 0.98 0.28 0.01 0.02 0.01 0.96
l 0.60 0.73 0.02 0.01 0.01 0.03 0.53
n 0.41 0.36 0.01 0.01 0.01 0.29
o 0.88 0.40 0.03
q 0.99 0.02 0.99
r 0.57 0.06 0.06
s 1.00 0.33 0.41 0.83 0.07 0.05 0.15 0.07
u 0.01 0.01 0.02
37
Phase Disabling Interaction

b disables a along the path b-c-d

a
c
b
a
c
c
b
b
a
d
38
Disabling Probabilities
Ph b c d g h i j k l n o q r s u

b 1.00 0.28 0.09 0.18 0.20 0.11 0.01
c 0.01 1.00 0.02 0.08 0.02 0.30 0.32 1.00 0.08
d 1.00 0.03 0.01 0.01
g 0.13 1.00 0.06 0.01 0.12 0.22
h 0.01 0.01 1.00 0.04 0.10 1.00 0.01
i 0.02 0.22 1.00 0.20 0.01 0.44 0.91
j 0.01 0.08 1.00 0.01 0.16
k 0.01 0.05 1.00 0.05 0.14 1.00
l 0.02 1.00 0.11 0.04 0.07 1.00 0.32 1.00
n 0.07 0.01 0.02 0.01 0.01 1.00 1.00 0.01
o 0.01 0.08 0.01 1.00
q 1.00
r 0.06 0.20 0.36 1.00 0.05
s 0.07 0.03 0.31 0.22 0.14 0.26 0.02 1.00
u 0.41 0.02 0.34 0.15 1.00
39
Faster Conventional Compiler

Modified VPO to use enabling and disabling phase
probabilities to decrease compilation time

pi - current probability of phase i
being active eij - probability of phase j
enabling phase i dij - probability of
phase j disabling phase i For each phase i
do pi eist While (any pi gt
0) do Select j as the current phase with
highest probability of being active Apply
phase j If phase j was active then
For each phase i, where i ! j do pi
((1-pi) eij) - (pi dij)
pj 0
40
Probabilistic Compilation Results
Function Old Compilation Old Compilation Prob. Compilation Prob. Compilation Prob. / Old Prob. / Old Prob. / Old
Attempted Active Attempted Active Time Size Speed
start_inp...(j) 233 16 55 14 0.469 1.014 N/A
parse_swi...(j) 233 14 53 12 0.371 1.016 0.972
start_inp...(j) 270 15 55 14 0.353 1.010 N/A
start_inp...(j) 233 14 49 13 0.420 1.003 N/A
start_inp...(j) 231 11 53 12 0.436 1.004 1.000
fft_float(f) 463 28 99 25 0.451 1.012 0.974
main(f) 284 20 73 18 0.550 1.007 1.000
sha_trans...(h) 284 17 67 16 0.605 0.965 0.953
read_scan...(j) 233 13 43 10 0.342 1.018 N/A
LZWReadByte(j) 268 12 45 11 0.325 1.014 N/A
main(j) 270 12 57 14 0.375 1.007 1.000
dijkstra(d) 231 9 43 9 0.409 1.010 1.000
.... .... .... .... .... .... .... ....

average 230.3 8.9 47.7 9.6 0.297 1.015 1.005
41
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

42
Conclusions

Phase ordering problem
long standing problem in compiler optimization
exhaustive evaluation always considered
infeasible
Exhaustively evaluated the phase order space
re-interpretation of the problem
novel application of search algorithms
fast pruning techniques
accurate prediction of relative performance
Analyzed properties of the phase order space to
speedup conventional compilation
published in CGO06, LCTES06, submitted to TOPLAS

43
Challenges

Exhaustive phase order search is a severe stress
test for the compiler
isolate analysis required and invalidated by each
phase
produce correct code for all phase orderings
eliminate all memory leaks
Search algorithm needs to be highly efficient
used CRCs and hashes for function comparisons
stored intermediate function instances to reduce
disk access
maintained logs to restart search after crash

44
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

45
VISTA

Provides an interactive code improvement paradigm
view low-level program representation
apply existing phases and manual changes in any
order
browse and undo previous changes
automatically obtain performance information
automatically search for effective phase
sequences
Useful as a research as well as teaching tool
employed in three universities
published in LCTES 03, TECS 06

46
VISTA Main Window
47
Faster Genetic Algorithm Searches

Improving performance of genetic algorithms
avoid redundant executions of the application
over 87 of executions were avoided
reduce search time by 62
modify search to obtain comparable results in
fewer generations
reduced GA generations by 59
reduce search time by 35
published in PLDI 04, TACO 05

48
Heuristic Search Algorithms

Analyzing the phase order space to improve
heuristic algorithms
detailed performance and cost comparison of
different heuristic algorithms
demonstrated the importance and difficulty of
selecting the correct sequence length
illustrated the importance of leaf function
instances
proposed modifications to existing algorithms,
and new search algorithms
published in CGO 07

49
Dynamic Compilation

Explored asynchronous dynamic compilation in a
virtual machine
demonstrated shortcomings of current popular
compilation strategy
describe importance of minimum compiler
utilization
discussed new compilation strategies
explored the changes needed to current
compilation strategies to exploit free cycles
Will be published in VEE 07

50
Outline

Experimental framework
Exhaustive phase order space evaluation
Faster conventional compilation
Conclusions
Summary of my other work
Future research directions

51
Compiler Technology Challenges
r
c
o
m
p
l
e
i
Machine Architecture
High Level Language

Support for parallelism
traditional languages
express parallelism
dynamic scheduling
Virtual machines
dynamic code generation and optimization
Push compilation decisions further down

multi-core
heterogeneous cores
No great solution
performance monitoring
software-controlled reconfiguration
Can no longer do it alone

52
Iterative Compilation Machine Learning

Improved scope for iterative compilation
machine learning
proliferation of new architectures
automate tuning compiler heuristics
tuning important libraries
using performance monitors
dynamic JIT compilers
How to use machine learning to optimize and
schedule more efficiently ?

53
Heterogeneous Multi-core Architectures

Can provide the best performance, cost, power
balance
Challenges
schedule tasks, allocate resources
dynamic core-specific optimization
automatic data layout to prevent conflicts

54
Dynamic Compilation

Virtual machines likely to grow in importance
productivity, portability, interoperability,
isolation...
Challenges
when, what, how to parallelize
using hardware performance monitors
using static analyses to aid dynamic compilation
debugging tools for correctness and performance
debugging

55
Questions ?
56
Results Code Size Summary

Exhaustively enumerated 234 out of 244 functions
Sequence length
Maximum 44 Average 14.71
Distinct function instances
Maximum 2,882,021 Average 89,947
Distinct control-flows
Maximum 920 Average 36.2
Code size improvement over default sequence
Maximum 63 Average 6.46

57
Results Performance Summary

Exhaustively evaluated 79 out of 88 functions
Sequence length
Maximum 44 Average 16.1
Distinct function instances
Maximum 2,882,021 Average 174,574.8
Distinct control-flows
Maximum 920 Average 47.4
Performance improvement over default sequence
Maximum 15 Average 4.8

58
Leaf Vs. Non-Leaf Performance
59
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
60
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
61
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
62
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
63
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
64
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
65
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
66
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
67
Weighted Function Instance DAG

Each node is weighted by the number of paths to a
leaf node

68
Predicting Relative Performance II
5
?
4 cycles
4 cycles
?
5
26 cycles
?
10 cycles
15 cycles
?
90 cycles
?
5
?
10 cycles
15 cycles
2 cycles
?
5
44 cycles
10 cycles
?
10 cycles
Total cycles ??
Total cycles 170
69
Case when No Leaf is Optimal

Write a Comment

User Comments (0)