Exhaustive Phase Order Search Space Exploration and Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Exhaustive Phase Order Search Space Exploration and Evaluation

Description:

Exhaustive Search. universally considered intractable ... auto. Description. Program. Category. 13 / 55. Outline. Experimental framework ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 67
Provided by: pras78
Learn more at: http://www.ittc.ku.edu
Category:

less

Transcript and Presenter's Notes

Title: Exhaustive Phase Order Search Space Exploration and Evaluation


1
Exhaustive Phase Order Search Space Exploration
and Evaluation
  • by
  • Prasad Kulkarni
  • (Florida State University)


2
Compiler Optimizations
  • To improve efficiency of compiler generated code
  • Optimization phases require enabling conditions
  • need specific patterns in the code
  • many also need available registers
  • Phases interact with each other
  • Applying optimizations in different orders
    generates different code

3
Phase Ordering Problem
  • To find an ordering of optimization phases that
    produces optimal code with respect to possible
    phase orderings
  • Evaluating each sequence involves compiling,
    assembling, linking, execution and verifying
    results
  • Best optimization phase ordering depends on
  • source application
  • target platform
  • implementation of optimization phases
  • Long standing problem in compiler optimization!!

4
Phase Ordering Space
  • Current compilers incorporate numerous different
    optimization phases
  • 15 distinct phases in our compiler backend
  • 15! 1,307,674,368,000
  • Phases can enable each other
  • any phase can be active multiple times
  • 1515 437,893,890,380,859,375
  • cannot restrict sequence length to 15
  • 1544 5.598 1051

5
Addressing Phase Ordering
  • Exhaustive Search
  • universally considered intractable
  • We are now able to exhaustively evaluate the
    optimization phase order space.

6
Re-stating of Phase Ordering
  • Earlier approach
  • explicitly enumerate all possible optimization
    phase orderings
  • Our approach
  • explicitly enumerate all function instances that
    can be produced by any combination of phases

7
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

8
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

9
Experimental Framework
  • We used the VPO compilation system
  • established compiler framework, started
    development in 1988
  • comparable performance to gcc O2
  • VPO performs all transformations on a single
    representation (RTLs), so it is possible to
    perform most phases in an arbitrary order
  • Experiments use all the 15 re-orderable
    optimization phases in VPO
  • Target architecture was the StrongARM SA-100
    processor

10
VPO Optimization Phases
ID Optimization Phase ID Optimization Phase
b branch chaining l loop transformations
c common subexpr. elim. n code abstraction
d remv. unreachable code o eval. order determin.
g loop unrolling q strength reduction
h dead assignment elim. r reverse branches
i block reordering s instruction selection
j minimize loop jumps u remv. useless jumps
k register allocation
11
Disclaimers
  • Did not include optimization phases normally
    associated with compiler front ends
  • no memory hierarchy optimizations
  • no inlining or other interprocedural
    optimizations
  • Did not vary how phases are applied
  • Did not include optimizations that require
    profile data

12
Benchmarks
  • 12 MiBench benchmarks 244 functions

Category Program Description
auto bitcount test processor bit manipulation abilities
auto qsort sort strings using quicksort sorting algorithm
network dijkstra Dijkstras shortest path algorithm
network patricia construct patricia trie for IP traffic
telecomm fft fast fourier transform
telecomm adpcm compress 16-bit linear PCM samples to 4-bit
consumer jpeg image compression and decompression
consumer tiff2bw convert color .tiff image to bw image
security sha secure hash algorithm
security blowfish symmetric block cipher with variable length key
office stringsearch searches for given words in phrases
office ispell fast spelling checker
13
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

14
Terminology
  • Active phase An optimization phase that
    modifies the function representation
  • Dormant phase A phase that is unable to find
    any opportunity to change the function
  • Function instance any semantically,
    syntactically, and functionally correct
    representation of the source function (that can
    be produced by our compiler)

15
Naïve Optimization Phase Order Space
  • All combinations of optimization phase sequences
    are attempted

L0
d
a
c
b
L1
d
a
d
a
d
a
d
a
b
c
b
c
b
c
b
c
L2
16
Eliminating Consecutively Applied Phases
  • A phase just applied in our compiler cannot be
    immediately active again

L0
d
a
c
b
L1
d
a
d
a
d
a
d
a
b
c
c
b
b
c
b
c
L2
17
Eliminating Dormant Phases
  • Get feedback from the compiler indicating if any
    transformations were successfully applied in a
    phase.

L0
d
a
c
b
L1
d
a
d
a
d
a
b
c
c
b
b
c
L2
18
Identical Function Instances
  • Some optimization phases are independent
  • example branch chaining register allocation
  • Different phase sequences can produce the same
    code
  • r2 1
  • r3 r4 r2
  • instruction selection
  • r3 r4 1
  • r2 1
  • r3 r4 r2
  • constant propagation
  • r2 1
  • r3 r4 1
  • dead assignment elimination
  • r3 r4 1

19
Equivalent Function Instances
sum 0 for (i 0 i lt 1000 i ) sum
a i Source Code
r100 r12HIa r12r12LOa
r1r12 r94000r12 L3
r8Mr1 r10r10r8 r1r14
ICr1?r9 PCIClt0,L3 Register
Allocation before Code Motion
r110 r10HIa r10r10LOa
r1r10 r94000r10 L5
r8Mr1 r11r11r8 r1r14
ICr1?r9 PCIClt0,L5 Code Motion
before Register Allocation
r320 r33HIa r33r33LOa
r34r33 r354000r33 L01
r36Mr34 r32r32r36
r34r344 ICr34?r35 PCIClt0,L01
After Mapping Registers
20
Efficient Detection of Unique Function Instances
  • After pruning dormant phases there may be tens or
    hundreds of thousands of unique instances
  • Use a CRC (cyclic redundancy check) checksum on
    the bytes of the RTLs representing the
    instructions
  • Used a hash table to check if an identical or
    equivalent function instance already exists in
    the DAG

21
Eliminating Identical/Equivalent Function
Instances
  • Resulting search space is a DAG of function
    instances

L0
a
c
b
L1
a
d
a
d
d
c
L2
22
Static Enumeration Results
Function Inst. Fn_inst Len CF Batch Vs. optimal
start_input_bmp (j) 1,372 120,777 25 70 1.41
correct(i) 1,295 1,348,154 25 663 4.18
main (t) 1,276 2,882,021 29 389 16.25
parse_switches (j) 1,228 180,762 20 53 0.41
start_input_gif (j) 1009 39,352 21 18 2.46
start_input_tga (j) 972 63,458 21 30 1.66
askmode (i) 942 232,453 24 108 7.87
skiptoword (i) 901 439,994 22 103 1.45
start_input_ppm (j) 795 8,521 16 45 2.70

Average (234) 196.2 89,946.7 14.7 36.2 6.46
23
Exhaustively enumerated the optimization phase
order space tofind an optimal phase ordering
with respect to code-size
Published in CGO 06
24
Determining Program Performance
  • Almost 175,000 distinct function instances, on
    average
  • largest enumerated function has 2,882,021
    instances
  • Too time consuming to execute each distinct
    function instance
  • assemble ? link ? execute more expensive than
    compilation
  • Many embedded development environments use
    simulation
  • simulation orders of magnitude more expensive
    than execution
  • Use data obtained from a few executions to
    estimate the performance of all remaining
    function instances

25
Determining Program Performance (cont...)
  • Function instances having identical control-flow
    graphs execute each block the same number of
    times
  • Execute application once for each control-flow
    structure
  • Statically estimate the number of cycles required
    to execute each basic block
  • dynamic frequency measure
  • S (static cycles block frequency)

26
Predicting Relative Performance I
20
20
4 cycles
4 cycles
5
5
27 cycles
25 cycles
15
15
22 cycles
20 cycles
2
2
2 cycles
2 cycles
5
5
5 cycles
10 cycles
20
20
10 cycles
10 cycles
Total cycles 789
Total cycles 744
27
Dynamic Frequency Results
Function Inst. Fn_inst Len CF Leaf from optimal from optimal
Function Inst. Fn_inst Len CF Leaf Batch Worst
main (t) 1,276 2,882,021 29 389 15,164 0.0 84.3
parse_switches (j) 1,228 180,762 20 53 2,027 6.7 64.8
askmode (i) 942 232,453 24 108 475 8.4 56.2
skiptoword (i) 901 439,994 22 103 2,834 6.1 49.6
start_input_ppm (j) 795 8,521 16 45 80 1.7 28.4
pfx_list_chk (i) 640 1,269,638 44 136 4,660 4.3 78.6
main (f) 624 2,789,903 33 122 4,214 7.5 46.1
sha_transform (h) 541 548,812 32 98 5,262 9.6 133.4
main (p) 483 14,510 15 10 178 7.7 13.1

Average (79) 234.3 174,574.8 16.1 47.4 813.4 4.8 65.4
28
Correlation Dynamic Frequency Counts Vs.
Simulator Cycles
  • Static performance estimation is inaccurate
  • ignored cache/branch misprediction penalties
  • Dynamic frequency counts may be sufficiently
    accurate
  • simplification of the estimation problem
  • most embedded systems have simpler architectures
  • We show strong correlation between our measure of
    performance and simulator cycles

29
Complete Function Correlation
  • Example init_search in stringsearch

30
Leaf Function Correlation
  • Leaf function instances are generated when no
    additional phases can be successfully applied
  • Leaf instances provide a good sampling
  • represents the only code that can be generated by
    an aggressive compiler, like VPO
  • at least one leaf instance represents an optimal
    phase ordering for over 86 of functions
  • significant percent of leaf instances among
    optimal

31
Leaf Function Correlation Statistics
  • Pearsons correlation coefficient
  • Accuracy of our estimate of optimal perf.

Sxy (SxSy)/n
Pcorr
sqrt( (Sx2 (Sx)2/n) (Sy2 - (Sy)2/n) )
cycle count for best leaf
Lcorr
cy. cnt for leaf with best dynamic freq count
32
Leaf Function Correlation Statistics (cont)
Function Pcorr Lcorr 0 Lcorr 0 Lcorr 1 Lcorr 1
Function Pcorr Ratio Leaves Ratio Leaves
AR_btbl...(b) 1.00 1.00 1 1.00 1
BW_btbl...(b) 1.00 1.00 2 1.00 2
bit_count.(b) 1.00 1.00 2 1.00 2
bit_shifter(b) 1.00 1.00 2 1.00 2
bitcount(b) 0.89 0.92 1 0.92 1
main(b) 1.00 1.00 6 1.00 23
ntbl_bitcnt(b) 1.00 0.95 2 0.95 2
ntbl_bit(b) 0.99 1.00 2 1.00 2
dequeue(d) 0.99 1.00 6 1.00 6
dijkstra(d) 1.00 0.97 4 1.00 269
.... . . . . .

average 0.96 0.98 4.38 0.996 21
33
Exhaustively evaluated the optimization phase
order space tofind a near-optimal phase
orderingwith respect to simulator cycles
Published in LCTES 06
34
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

35
Phase Enabling Interaction
  • b enables a along the path a-b-a

a
c
b
a
c
c
b
b
a
d
36
Phase Enabling Probabilities
Ph St b c d g h i j k l n o q r s u

b 0.72 0.02 0.01 0.04 0.01 0.02 0.66
c 1.00 0.01 0.68 0.01 0.02 0.07 0.05 0.15 0.34
d 1.00 1.00 1.00
g 0.22 0.28 0.17 0.05 0.02 0.14 0.34 0.09 0.15
h 0.08 0.16 0.14 0.02 0.01 0.20
i 0.72 0.04 0.01 0.09
j 0.03 0.06 0.44
k 0.98 0.28 0.01 0.02 0.01 0.96
l 0.60 0.73 0.02 0.01 0.01 0.03 0.53
n 0.41 0.36 0.01 0.01 0.01 0.29
o 0.88 0.40 0.03
q 0.99 0.02 0.99
r 0.57 0.06 0.06
s 1.00 0.33 0.41 0.83 0.07 0.05 0.15 0.07
u 0.01 0.01 0.02
37
Phase Disabling Interaction
  • b disables a along the path b-c-d

a
c
b
a
c
c
b
b
a
d
38
Disabling Probabilities
Ph b c d g h i j k l n o q r s u

b 1.00 0.28 0.09 0.18 0.20 0.11 0.01
c 0.01 1.00 0.02 0.08 0.02 0.30 0.32 1.00 0.08
d 1.00 0.03 0.01 0.01
g 0.13 1.00 0.06 0.01 0.12 0.22
h 0.01 0.01 1.00 0.04 0.10 1.00 0.01
i 0.02 0.22 1.00 0.20 0.01 0.44 0.91
j 0.01 0.08 1.00 0.01 0.16
k 0.01 0.05 1.00 0.05 0.14 1.00
l 0.02 1.00 0.11 0.04 0.07 1.00 0.32 1.00
n 0.07 0.01 0.02 0.01 0.01 1.00 1.00 0.01
o 0.01 0.08 0.01 1.00
q 1.00
r 0.06 0.20 0.36 1.00 0.05
s 0.07 0.03 0.31 0.22 0.14 0.26 0.02 1.00
u 0.41 0.02 0.34 0.15 1.00
39
Faster Conventional Compiler
  • Modified VPO to use enabling and disabling phase
    probabilities to decrease compilation time

pi - current probability of phase i
being active eij - probability of phase j
enabling phase i dij - probability of
phase j disabling phase i For each phase i
do pi eist While (any pi gt
0) do Select j as the current phase with
highest probability of being active Apply
phase j If phase j was active then
For each phase i, where i ! j do pi
((1-pi) eij) - (pi dij)
pj 0
40
Probabilistic Compilation Results
Function Old Compilation Old Compilation Prob. Compilation Prob. Compilation Prob. / Old Prob. / Old Prob. / Old
Attempted Active Attempted Active Time Size Speed
start_inp...(j) 233 16 55 14 0.469 1.014 N/A
parse_swi...(j) 233 14 53 12 0.371 1.016 0.972
start_inp...(j) 270 15 55 14 0.353 1.010 N/A
start_inp...(j) 233 14 49 13 0.420 1.003 N/A
start_inp...(j) 231 11 53 12 0.436 1.004 1.000
fft_float(f) 463 28 99 25 0.451 1.012 0.974
main(f) 284 20 73 18 0.550 1.007 1.000
sha_trans...(h) 284 17 67 16 0.605 0.965 0.953
read_scan...(j) 233 13 43 10 0.342 1.018 N/A
LZWReadByte(j) 268 12 45 11 0.325 1.014 N/A
main(j) 270 12 57 14 0.375 1.007 1.000
dijkstra(d) 231 9 43 9 0.409 1.010 1.000
.... .... .... .... .... .... .... ....

average 230.3 8.9 47.7 9.6 0.297 1.015 1.005
41
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

42
Conclusions
  • Phase ordering problem
  • long standing problem in compiler optimization
  • exhaustive evaluation always considered
    infeasible
  • Exhaustively evaluated the phase order space
  • re-interpretation of the problem
  • novel application of search algorithms
  • fast pruning techniques
  • accurate prediction of relative performance
  • Analyzed properties of the phase order space to
    speedup conventional compilation
  • published in CGO06, LCTES06, submitted to TOPLAS

43
Challenges
  • Exhaustive phase order search is a severe stress
    test for the compiler
  • isolate analysis required and invalidated by each
    phase
  • produce correct code for all phase orderings
  • eliminate all memory leaks
  • Search algorithm needs to be highly efficient
  • used CRCs and hashes for function comparisons
  • stored intermediate function instances to reduce
    disk access
  • maintained logs to restart search after crash

44
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

45
VISTA
  • Provides an interactive code improvement paradigm
  • view low-level program representation
  • apply existing phases and manual changes in any
    order
  • browse and undo previous changes
  • automatically obtain performance information
  • automatically search for effective phase
    sequences
  • Useful as a research as well as teaching tool
  • employed in three universities
  • published in LCTES 03, TECS 06

46
VISTA Main Window
47
Faster Genetic Algorithm Searches
  • Improving performance of genetic algorithms
  • avoid redundant executions of the application
  • over 87 of executions were avoided
  • reduce search time by 62
  • modify search to obtain comparable results in
    fewer generations
  • reduced GA generations by 59
  • reduce search time by 35
  • published in PLDI 04, TACO 05

48
Heuristic Search Algorithms
  • Analyzing the phase order space to improve
    heuristic algorithms
  • detailed performance and cost comparison of
    different heuristic algorithms
  • demonstrated the importance and difficulty of
    selecting the correct sequence length
  • illustrated the importance of leaf function
    instances
  • proposed modifications to existing algorithms,
    and new search algorithms
  • published in CGO 07

49
Dynamic Compilation
  • Explored asynchronous dynamic compilation in a
    virtual machine
  • demonstrated shortcomings of current popular
    compilation strategy
  • describe importance of minimum compiler
    utilization
  • discussed new compilation strategies
  • explored the changes needed to current
    compilation strategies to exploit free cycles
  • Will be published in VEE 07

50
Outline
  • Experimental framework
  • Exhaustive phase order space evaluation
  • Faster conventional compilation
  • Conclusions
  • Summary of my other work
  • Future research directions

51
Compiler Technology Challenges
r
c
o
m
p
l
e
i
Machine Architecture
High Level Language
  • Support for parallelism
  • traditional languages
  • express parallelism
  • dynamic scheduling
  • Virtual machines
  • dynamic code generation and optimization
  • Push compilation decisions further down
  • multi-core
  • heterogeneous cores
  • No great solution
  • performance monitoring
  • software-controlled reconfiguration
  • Can no longer do it alone

52
Iterative Compilation Machine Learning
  • Improved scope for iterative compilation
    machine learning
  • proliferation of new architectures
  • automate tuning compiler heuristics
  • tuning important libraries
  • using performance monitors
  • dynamic JIT compilers
  • How to use machine learning to optimize and
    schedule more efficiently ?

53
Heterogeneous Multi-core Architectures
  • Can provide the best performance, cost, power
    balance
  • Challenges
  • schedule tasks, allocate resources
  • dynamic core-specific optimization
  • automatic data layout to prevent conflicts

54
Dynamic Compilation
  • Virtual machines likely to grow in importance
  • productivity, portability, interoperability,
    isolation...
  • Challenges
  • when, what, how to parallelize
  • using hardware performance monitors
  • using static analyses to aid dynamic compilation
  • debugging tools for correctness and performance
    debugging

55
Questions ?
56
Results Code Size Summary
  • Exhaustively enumerated 234 out of 244 functions
  • Sequence length
  • Maximum 44 Average 14.71
  • Distinct function instances
  • Maximum 2,882,021 Average 89,947
  • Distinct control-flows
  • Maximum 920 Average 36.2
  • Code size improvement over default sequence
  • Maximum 63 Average 6.46

57
Results Performance Summary
  • Exhaustively evaluated 79 out of 88 functions
  • Sequence length
  • Maximum 44 Average 16.1
  • Distinct function instances
  • Maximum 2,882,021 Average 174,574.8
  • Distinct control-flows
  • Maximum 920 Average 47.4
  • Performance improvement over default sequence
  • Maximum 15 Average 4.8

58
Leaf Vs. Non-Leaf Performance
59
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
60
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
61
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
62
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
63
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
64
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
65
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
66
Phase Order Space Evaluation Summary
generate next optimization sequence
last phase active?
identical function instance?
equivalent function instance?
Y
N
N
Y
Y
N
add node to DAG
Y
calculate function performance
simulate application
seen control-flow structure?
N
67
Weighted Function Instance DAG
  • Each node is weighted by the number of paths to a
    leaf node

68
Predicting Relative Performance II
5
?
4 cycles
4 cycles
?
5
26 cycles
?
10 cycles
15 cycles
?
90 cycles
?
5
?
10 cycles
15 cycles
2 cycles
?
5
44 cycles
10 cycles
?
10 cycles
Total cycles ??
Total cycles 170
69
Case when No Leaf is Optimal
Write a Comment
User Comments (0)
About PowerShow.com